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Abstract 

The  paper  presents  a  generalized  regression  technique  centered  on  a  superquantile  (also  called 
conditional  value-at-risk)  that  is  consistent  with  that  coherent  measure  of  risk  and  yields  more 
conservatively  fitted  curves  than  classical  least-squares  and  quantile  regressions.  In  contrast  to 
other  generalized  regression  techniques  that  approximate  conditional  superquantiles  by  various 
combinations  of  conditional  quantiles,  we  directly  and  in  perfect  analog  to  classical  regression 
obtain  superquantile  regression  functions  as  optimal  solutions  of  certain  error  minimization 
problems.  We  show  the  existence  and  possible  uniqueness  of  regression  functions,  discuss  the 
stability  of  regression  functions  under  perturbations  and  approximation  of  the  underlying  data, 
and  propose  an  extension  of  the  coefficient  of  determination  R-squared  for  assessing  the  goodness 
of  fit.  The  paper  presents  two  numerical  methods  for  solving  the  error  minimization  problems 
and  illustrates  the  methodology  in  several  numerical  examples  in  the  areas  of  uncertainty  quan¬ 
tification,  reliability  engineering,  and  financial  risk  management. 


1  Introduction 

Analysts  and  decision  makers  are  often  concerned  with  a  random  variable  describing  possible  ‘cost,’ 
‘loss,’  or  ‘damage.’  The  interest  may  be  focused  on  a  single  ‘system’  or  could  involve  study  and  com¬ 
parison  across  a  multitude  of  systems  and  designs.  In  either  case,  it  may  be  beneficial  to  attempt  to 
approximate  such  a  loss  random  variable  Y  in  terms  of  an  n-dimensional  explanatory  random  vector 
X  that  is  more  accessible  in  some  sense.  This  situation  naturally  leads  to  least-squares  regression 
and  related  models  that  estimate  conditional  expectations.  While  such  models  are  adequate  in 
many  situations,  they  fall  short  in  contexts  where  a  decision  maker  is  risk  averse,  i.e. ,  is  more  con¬ 
cerned  about  upper-tail  realizations  of  Y  than  average  loss,  and  views  errors  asymmetrically  with 
underestimating  losses  being  more  detrimental  than  overestimating.  We  focus  on  such  contexts 
and  therefore  maintain  an  orientation  of  Y  that  implies  that  high  realizations  are  unfortunate  and 
low  realizations  are  favorable.  Of  course,  a  parallel  development  with  an  opposite  orientation  of 
the  random  variable  Y,  focused  on  profits  and  gains,  and  concerns  about  overestimating  instead  of 
underestimating  is  also  possible  but  not  pursued  here. 

Quantile  regression  (see  [16,  9]  and  references  therein)  accommodates  risk-averseness  and  an 
asymmetric  view  of  errors  by  estimating  conditional  quantiles  at  a  certain  probability  level  such  as 
those  in  the  tail  of  the  conditional  distribution  of  Y.  A  quantile  corresponds  to  ‘value-at-risk’  (VaR) 
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in  financial  terminology  and  relates  to  ‘failure  probability’  in  engineering  terms.  Quantile  regression 
informs  the  decision  maker  about  these  quantities  conditional  on  values  of  the  explanatory  random 
vector  X.  However,  a  quantile  is  not  a  coherent  measure  of  risk  in  the  sense  of  Artzner  et  al. 
[2]  (see  also  [7]);  it  fails  to  be  subadditive.  Consequently,  a  quantile  of  the  sum  of  two  random 
variables  may  exceed  the  sum  of  the  quantiles  of  each  random  variable  at  the  same  probability 
level,  which  runs  counter  to  our  understanding  of  what  ‘risk’  should  express.  Moreover,  quantiles 
cause  computational  challenges  when  incorporated  into  decision  optimization  problems  as  objective 
function,  failure  probability  constraint,  or  chance  constraint.  The  use  of  quantiles  and  the  closely 
related  failure  probabilities  is  therefore  problematic  in  risk-averse  decision  making;  see  [2,  25,  20,  21] 
for  a  detailed  discussion. 

A  superquantile  of  a  random  variable,  also  called  conditional  value-at-risk,  average  value-at- 
risk,  and  expected  shortfall1,  is  an  ‘average’  of  certain  quantiles  as  described  further  below.  It’s  a 
coherent  measure  of  risk  well  suited  for  risk-averse  decision  making  and  optimization;  see  [28]  for 
its  application  in  financial  engineering,  [13]  for  military  applications,  and  [20]  for  use  in  reliability 
engineering.  While  this  risk  measure  has  reached  prominence  in  risk-averse  optimization,  there  has 
been  much  less  work  on  regression  techniques  that  are  consistent  in  some  sense  with  it.  In  this 
paper,  we  derive  such  a  superquantile  regression  methodology,  study  its  properties,  and  propose 
means  to  assess  the  goodness-of-fit.  The  importance  of  such  a  regression  methodology  becomes 
apparent  by  considering  the  following  two  situations. 

Suppose  that  a  loss  is  given  by  a  random  variable  Y .  but  our  primary  concern  is  with  the 
conditional  loss  given  that  an  explanatory  random  vector  X  takes  on  specific  values.  We  aim  to 
select  these  values  judiciously  in  an  effort  to  minimize  the  conditional  loss.  We  denote  by  Y(x) 
the  conditional  random  variable  Y  given  that  X  =  x  €  Mn.  Of  course,  ‘minimizing’  Y (x)  is  not 
well-defined  and  a  standard  approach  is  to  minimize  a  risk  measure  of  Y(x)\  see  for  example  [21]. 
An  attractive  choice  is  to  use  a  superquantile  measure  of  risk,  which  as  mentioned  above  is  coherent 
and  also  computationally  approachable.  While  in  some  contexts  a  superquantile  of  Y (. x )  can  be 
evaluated  easily  for  any  x  €  Mn ,  there  are  numerous  situations,  especially  beyond  the  financial 
domain,  where  only  a  data  base  of  realizations  of  Y (x)  is  available  for  various  x.  In  the  latter 
situation,  there  is  a  need  for  building  an  approximating  model,  based  on  the  data,  for  the  relevant 
superquantile  of  Y (x)  as  a  function  of  x.  We  refer  to  this  as  superquantile  tracking.  In  comparison, 
if  the  goal  were  to  minimize  the  expectation  of  T(x),  then  least-squares  regression  would  yield  a 
model  that  approximates  that  conditional  expectation.  Likewise,  if  the  goal  were  to  minimize  a 
quantile  of  Y(x),  quantile  regression  would  provide  a  model  of  the  conditional  quantile.  While  these 
models  are  valuable  for  analysts  and  decision  makers  focused  on  the  expectation  and  quantile  risk 
measures,  they  don’t  provide  estimates  of  conditional  superquantiles.  In  essence,  the  same  need  for 
estimating  conditional  superquantiles  arises  in  reliability  engineering  when  the  goal  is  to  determine 
a  ‘design’  x  with  buffered  failure  probability  of  Y (x)  being  no  larger  than  a  given  probability  level, 
which  corresponds  to  a  constraint  on  a  superquantile  of  Y(x)  [20]. 

Another  situation  arises  when  the  explanatory  random  vector  X  is  beyond  our  direct  control, 
but  the  dependence  between  the  loss  random  variable  Y  and  X  makes  us  hopeful  that,  for  a  carefully 
selected  regression  function  f  :  lRn  — >•  M ,  the  random  variable  f{X)  may  serve  as  a  surrogate  for  Y. 
When  the  distribution  of  X  is  known,  at  least  approximately,  and  /  has  been  determined,  then  the 
distribution  of  f(X)  is  usually  easily  accessible.  That  distribution  may  then  serve  as  input  to  further 
analysis,  simulation,  and  optimization  in  place  of  the  unknown  distribution  of  Y .  Such  surrogate 
estimation  may  arise  in  numerous  contexts.  ‘Factor  models’  in  financial  investment  applications 
(see  for  example  [6,  15]),  where  Y  may  be  the  loss  associated  with  a  particular  asset  and  X  a 
vector  describing  a  small  number  of  macroeconomic  ‘factors,’  is  a  result  of  surrogate  estimation. 

1We  prefer  the  application-neutral  name  ‘superquantile’  when  deriving  methods  applicable  broadly. 
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‘Uncertainty  quantification’  (see  for  example  [17,  8])  considers  the  output  of  a  system  described  by 
a  random  variable  Y,  for  example  measuring  damage,  and  estimates  its  moments  and  distribution 
from  observed  realizations  as  well  as  knowledge  about  the  distribution  of  the  input  to  the  system 
characterized  by  a  random  vector  X.  A  main  approach  here  centers  on  surrogate  estimation  with 
f(X )  serving  as  an  estimate  of  Y.  In  this  situation,  an  essential  question  is  what  criterion  should 
be  used  for  selecting  /.  Clearly,  one  would  like  the  error  random  variable  Zf  :=  Y  —  f(X)  to  be 
small  in  some  sense.  However,  minimizing  the  mean-squared  error  of  Zf  would  not  reflect  a  greater 
concern  about  underestimating  Y,  i.e.,  underestimating  losses,  than  overestimating.  We  may  want 
to  assess  the  error  of  Zf  in  a  manner  that  is  ‘consistent’  with  our  use  of  a  superquantile  as  risk 
measure. 

In  this  paper,  we  develop  a  ‘generalized’  regression  technique  that  addresses  the  issue  of  su¬ 
perquantile  tracking  and  surrogate  estimation.  The  technique  is  an  extension  of  least-squares  and 
quantile  regression,  which  center  on  expectations  and  quantiles,  respectively,  to  one  that  focuses 
on  superquantiles. 

The  foundation  of  least-squares  and  quantile  regression  is  the  fact  that  mean  and  quantiles 
minimize  the  expectation  of  certain  convex  random  functions.  A  natural  extension  to  superquantile 
regression  could  then  possibly  involve  determining  a  random  function  that  when  minimizing  its 
expectation,  we  obtain  a  superquantile.  However,  such  a  random  function  doesn’t  exist  [10,  5], 
which  has  lead  to  studies  of  indirect  approaches  to  superquantile  regression  grounded  in  quantile 
regression. 

For  a  random  variable  with  a  continuous  cumulative  distribution  function,  a  superquantile 
equals  a  conditional  expectation  of  the  random  variable  given  realizations  no  lower  than  the  corre¬ 
sponding  quantile.  Utilizing  this  fact,  studies  have  developed  kernel-based  estimators  for  the  con¬ 
ditional  probability  density  functions,  which  are  then  integrated  and  inverted  to  obtain  estimators 
of  conditional  quantiles.  An  estimator  of  the  conditional  superquantile  is  then  finally  constructed 
by  integrating  the  density  estimator  over  the  interval  above  the  quantile  [26,  4]  or  forming  a  sample 
average  [14].  These  studies  also  include  asymptotic  analysis  of  the  resulting  estimators  under  a 
series  of  assumptions,  including  that  the  data  originates  from  certain  time  series. 

A  superquantile  of  a  random  variable  is  defined  in  terms  of  an  integral  of  corresponding  quan¬ 
tiles  with  respect  to  the  probability  level.  Since  the  integral  is  approximated  by  a  weighted  sum 
of  quantiles  across  different  probability  levels,  an  estimator  of  a  conditional  superquantile  emerges 
as  the  sum  of  conditional  quantiles  obtained  by  quantile  regression;  see  [18,  19],  which  also  show 
asymptotic  results  under  a  set  of  assumptions  including  the  continuous  differentiability  of  the  cumu¬ 
lative  distribution  function  of  the  conditional  random  variables.  Similarly,  [5]  utilizes  the  integral 
expression  for  a  superquantile,  but  observes  that  a  weighted  sum  of  quantiles  is  an  optimal  so¬ 
lution  of  a  certain  minimization  problem;  see  [21].  Analogously  to  the  situation  in  least-squares 
and  quantile  regression,  an  optimization  problem  therefore  yields  an  estimator  of  a  conditional  su¬ 
perquantile.  Though,  in  contrast  to  the  case  of  least-squares  and  quantile  regression,  the  estimator 
is  ‘biased’  due  to  the  error  induced  by  replacing  an  integral  by  a  finite  sum.  Under  a  linear  model 
assumption,  [5]  also  constructs  a  conditional  superquantile  estimator  using  an  appropriately  shifted 
least-squares  regression  curve  based  on  quantile  estimates  of  residuals.  In  both  cases,  asymptotic 
results  are  obtained  for  a  homoscedastic  linear  regression  model.  Under  the  same  model,  [27]  stud¬ 
ies  ‘constrained’  regression,  where  the  error  random  variable  Zf  =  Y  —  f(X)  is  minimized  in  some 
sense,  for  example  in  terms  of  least  square  or  absolute  deviation,  subject  to  a  constraint  that  limits 
a  superquantile  of  Zf.  While  this  approach  doesn’t  lead  to  superquantile  regression  in  the  sense 
we  derive  below,  it  highlights  the  need  for  alternative  techniques  for  regression  that  incorporate 
superquantiles  in  some  manner. 

The  need  for  moving  beyond  classical  regression  centered  on  conditional  expectations  is  therefor 
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now  well  recognized  and  has  driven  even  further  research  towards  estimating  conditional  distribution 
function,  i.e.,  Prob(Y (x)  <  y)  for  all  y  G  1R,  using  nonparametric  kernel  estimators  (see  for  example 
[11])  and  transformation  models  (see  for  example  [12]).  Of  course,  conditional  distribution  functions 
provide  the  ‘full’  information  about  Y{x)  including  its  quantiles  and  superquantiles,  and  therefor 
also  provide  a  means  to  inform  a  risk-averse  decision  maker.  In  this  paper,  however,  we  directly 
focus  on  superquantiles,  which  we  believe  deserve  special  attention  due  to  their  prominence  in  risk 
analysis. 

A  framework  for  ‘generalized’  regression  is  laid  out  in  [22,  21]  and  regression  functions  are 
obtained  as  optimal  solutions  of  optimization  problems  of  the  form  min  f  £(Zf),  where  £  is  a  measure 
of  error  and  /  is  restricted  to  a  certain  class  of  functions  such  as  the  affine  functions.  Least-squares 
regression  is  obtained  by  £{Zf )  =  E[Zj\,  quantile  regression  with  the  Koenker-Bassett  measure  of 
error,  but  many  other  possibilities  exist.  While  it  is  not  possible  to  determine  a  measure  of  error  that 
is  of  the  expectation  type  and  yields  a  superquantile,  in  this  paper  we  show  that  when  allowing  for 
a  broader  class  of  functionals,  a  measure  of  error  that  generates  a  superquantile  is  indeed  available. 
Such  a  measure  of  error  is  also  hinted  at  in  our  recent  paper  [24],  but  the  present  paper  gives  the  first 
comprehensive  treatment.  In  contrast  to  previous  studies  towards  superquantile  regression,  which 
utilize  indirect  approaches  and  quantile  regression,  we  here  offer  a  natural  extension  of  least-squares 
and  quantile  regression.  We  replace  the  mean-squares  and  Koenker-Bassett  error  measures  by  a  new 
error  measure,  and  then  simply  minimize  that  error  of  Zf  to  obtain  a  regression  function.  Under  few 
assumptions,  we  establish  the  existence  of  a  regression  function,  discuss  its  uniqueness,  and  examine 
stability  under  perturbations  of  the  distribution  of  ( X ,  Y)  for  example  caused  by  sampling.  We 
omit  a  discussion  of  simple  linear  models  with  independent  and  identically  distributed  (iid)  noise 
as  we  believe  that  there  is  little  need  for  quantile  and  superquantile  regression  in  such  contexts 
as  least-squares  regression  with  an  appropriate  shift  suffices.  In  fact,  we  don’t  separate  models 
into  (additive)  deterministic  and  stochastic  terms.  In  many  applications,  especially  in  the  area  of 
uncertainty  quantification,  heteroscedasticity  and  dependence  are  prevalent  making  linear  iid  and 
additive  models  of  little  value. 

Section  2  describes  measures  of  regret  and  error,  first  in  the  context  of  quantile  regression  and 
then  for  the  extension  to  superquantile  regression.  Section  3  defines  superquantile  regression  as  the 
minimization  of  a  measure  of  error,  discusses  existence  and  uniqueness  of  the  regression  function, 
and  provides  asymptotic  results.  Section  4  proposes  an  approach  for  assessing  the  goodness-of-fit 
of  regression  function  obtained  by  superquantile  regression.  Section  5  deals  with  computational 
methods  for  superquantile  regression  and  Section  6  gives  illustrative  examples. 


2  Quantiles,  Superquantiles,  and  Errors 

While  our  development  centers  on  superquantiles,  it  is  beneficial  to  maintain  a  parallel  description 
of  quantiles.  As  we  see  below,  quantile  regression,  which  is  achieved  by  minimizing  a  Koenker- 
Bassett  error  of  the  random  variable  Zf,  provides  a  road  map  for  the  construction  of  superquantile 
regression,  which  is  simply  achieved  by  minimizing  another  measure  of  error.  We  start,  however, 
with  definitions  of  quantiles,  superquantiles,  and  corresponding  measures  of  regret  and  error. 

2.1  Definitions 

For  a  G  [0, 1],  the  a-quantile  of  a  random  variable  Y  with  cumulative  distribution  function  Fy  is 
defined  as 

qa(Y)  :=  min {y  G  M  \  FY(y )  >  a}. 
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Its  quantiles  are  as  fundamental  to  Y  as  the  distribution  function,  but  are  problematic  to  incorporate 
in  risk  analysis  and  optimization  due  to  their  lack  of  coherency  as  well  as  computational  challenges. 
Superquantiles  have  more  favorable  properties.  For  a  €  [0,1),  the  a-superquantile  of  a  random 
variable  Y  is  defined  as 

1 

Q a{Y)  ■■=  j  Q/3  00  df3.  (1) 

a 

Since  a  superquantile  is  a  coherent  measure  of  risk  and  by  the  virtue  of  being  an  ‘average’  of 
quantiles  is  also  more  stable  than  a  quantile  in  some  sense,  it’s  well  suited  for  applications.  For 
a  =  1,  we  define  qa(Y)  :=  sup  Y  (the  essential  supremum).  Since  qo{Y)  =  E[Y],  we  therefore  focus 
on  a  €  (0, 1)  throughout  the  paper  to  avoid  distractions  by  these  special  cases. 

In  reliability  terminology,  quantiles  and  super  quantiles  correspond  to  failure  and  buffered  fail¬ 
ure  probabilities.  The  failure  probability  of  a  loss  random  variable  Y  is 

p(Y)  :=  Prob(T  >  0)  =  1  -  Fy(0), 


which  corresponds  to 

p(Y)  =  1  —  a  with  a  such  that  qa(Y )  =  0 

if  there  is  no  probability  atom  at  zero.  Analogously  to  the  latter  expression,  the  buffered  failure 
probability  (see  [20])  of  a  loss  random  variable  Y  is  defined  as 

p(Y)  :=  1  —  a  with  a  such  that  qa(Y )  =  0.  (2) 

A  requirement  that  p(Y)  <  1  —  a  is  therefore  equivalent  to  the  constraint  that  qa{Y)  <  0.  Conse¬ 
quently,  in  applications  with  a  buffered  failure  probability  constraint  on  a  (conditional)  loss  random 
variable  Y  (x)  as  well  as  when  the  goal  is  to  minimize  a  superquantile  of  Y  ( x )  directly,  there  are 
needs  for  estimating  qa(Y(x))  as  a  function  of  x  €  Mn .  Quantiles  and  superquantiles  are  connected 
through  a  trade-off  formula  that  leads  to  quantile  regression  as  discussed  next. 

2.2  Measures  of  Regret  and  Error  in  Quantile  Regression 

Both  a-quantiles  and  a-superquantiles,  a  €  [0,1),  of  a  loss  random  variable  Y  are  expressed  in 
terms  of  an  optimization  problem  involving  the  quantity 

Va(Y)  :=  — — E[max{y,0}],  (3) 

I  —  a 

which  is  a  measure  of  regret  that  quantifies  the  displeasure  with  realizations  of  Y  above  zero;  see 
[21].  Quantiles  and  superquantiles  then  follow  as 

qa{Y)  =  argrnin  {C0 +  Va{Y  -  Cq)}  (4) 

C0&R 

qa(Y)  =  min  {C0  +  Va(Y  -  C0)}  ,  (5) 

CbeR 

where  we  for  simplicity  assume  that  an  optimal  solution  is  unique.  In  general,  this  may  not  be  the 
case  and,  traditionally,  the  lowest  optimal  solution  has  been  defined  as  the  quantile. 

The  expression  for  qa(Y)  is  the  essential  building  block  for  quantile  regression,  but  since  we 
ultimately  would  like  to  go  beyond  the  class  of  constant  functions  as  candidates  for  a  regression 
function  we  need  to  pass  to  a  measure  of  error  £a  constructed  from  Va  by  setting 

Sa(Y)  :=  Va(Y)  -  E[Y } 
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for  any  loss  random  variable  Y  (with  E[|y|]  <  oo).  A  measure  of  error  quantifies  the  degree  of 
‘nonzeroness’  in  a  random  variable;  see  [21].  Direct  application  of  this  definition  and  a  recognition 
that  a  constant  term  in  an  objective  function  is  immaterial  with  respect  to  the  optimal  solution 
gives  that 

qa(Y)  =  argmin  £Q.(F  -  Co),  (6) 

C0eR 

where  again  the  set  of  optimal  solutions  may  not  be  a  singleton  and 

1  \  a 

£a(Y)  = - MmaxjY,  0}]  —  E[Y]  =  E  - maxjY,  0}  +  maxi- Y,  0} 

1  —  a  \_1  —  a 

is  a  (scaled)  Koenker-Bassett  error  [16] .  Quantile  regression  centers  on  computing  this  argrnin  with 
“minimizing  the  error  of  Y  —  Co  over  Co  £  M"  replaced  by  “minimizing  the  error  of  Y  —  f(X)  over 
a  class  of  functions  /  :  Mn  —>■  ,  often  taken  to  be  the  affine  functions.  We  view  qa{Y)  as  the 

‘closest’  scalar  to  the  random  variable  Y  under  a  Koenker-Bassett  error. 

If  our  goal  simply  were  to  estimate  qa(Y )  of  a  loss  random  variable  Y  for  a  given  a  £  (0, 1),  the 
above  expressions  would  have  sufficed,  possibly  passing  to  an  empirical  distribution  given  by  a  sam¬ 
ple  if  Ey  is  unknown.  In  the  present  context,  however,  connections  with  the  underlying  explanatory 
random  vector  X  and  the  focus  on  the  ‘approximation’  of  Y  warrants  a  parallel  development  to 
that  of  quantile  regression  centered  on  a  superquantile.  In  view  of  the  above  review  of  quantile 
regression,  it’s  clear  that  superquantile  regression  will  involve  the  minimization  of  some  measure  of 
error  that  returns  the  superquantile  as  argrnin2.  The  next  subsection  develops  such  a  measure  by 
first  constructing  a  corresponding  measure  of  regret. 

2.3  Measures  of  Regret  and  Error  in  Superquantile  Regression 

We  start  this  subsection  by  establishing  the  finiteness  of  a  superquantile  under  the  assumption 
that  the  loss  random  variable  Y  has  a  finite  second  moment  and  write  Y  £  £2(D)  :=  {Y  :  Q  —> 
R  |  E[Y2}  <  oo}. 

We  know  from  [21]  that  qa  is  a  convex,  positively  homogenous,  monotonic,  and  averse3  func¬ 
tional  on  C2(fl)  for  a  €  (0, 1).  A  superquantile  is  also  bounded  by  [24,  Theorem  3],  which  we  repeat 
here  with  a  different  proof.  We  adopt  the  notation  jn(Y)  =  E[Y]  and  er2(K)  =  E[(Y  —  /a(Y))2]. 

Proposition  1  For  Y  £  £2(D)  and  a  £  (0, 1)  one  has  that 

qa(Y)  <  f*(Y)  +  -=L= a(Y ).  (7) 

V 1  —  a 

Proof:  Suppose  that  the  quantile  qa(Y),  viewed  as  a  function  of  the  probability  level,  is  continuous 
at  a.  Let  Ia  be  the  indicator  function  of  the  interval  [ga(K),oo)  with  probability  1  —  a.  We  then 
have  by  the  Schwartz  inequality  that 

(1  -  a)qa(Y  -  m(K))  =  E[(Y  -  ii(Y))Ia\  <  ^  E[(Y  -  ^Y))2]^!2 1  =  <Y)Vl^. 

Then,  since  qa(Y  —  n(Y))  =  qa{Y)  —  n(Y),  the  result  follows  from  dividing  by  1  —  a.  Thus,  (7) 
is  valid  under  the  continuity  assumption  about  the  quantile,  which  is  true  for  all  but  at  most 
countably  many  a.  By  continuity  of  both  sides  of  (7)  with  respect  to  a,  it  must  then  hold  for  all 

2  Classical  least-squares  regression  can  be  viewed  similarly  as  returning  a  (conditional)  expectation  as  argmin  when 
minimizing  mean-square  measure  of  error,  i.e. ,  E[Y]  =  argmin c0eR  E[(Y  —  Co)2]. 

3 We  recall  that  a  functional  T  :  C2(fl)  — >  1R  =  R  U  {—00,00}  is  averse  if  E(X)  >  E\X]  for  all  nonconstant 
XgjC.2{Q). 


6 


(i  €  (0,1). 


u 


The  measure  of  regret  that  serves  in  the  context  of  superquantile  regression  is  defined  for  any 
loss  random  variable  Y  and  a  €  (0, 1)  as 

V«c n  :=  Vo(T),  (8) 

1  —  a 

where 

i 

V°(Y)  :=  J  max{0,qp(Y)}dp.  (9) 

o 

These  expressions  appear  in  [24],  which  also  explains  their  discovery.  Here,  we  provide  an  alter¬ 
native,  direct  proof  of  how  they  lead  to  a  superquantile.  We  start,  however,  with  two  preliminary 
results  and  the  definition  of  a  corresponding  error  measure. 

Lemma  1  For  Y  €  £2(fl), 


Vo 00  <  o(Y)  +  max{0,  n{Y)  +  a (T)}. 


(10) 


Proof:  From  (7)  and  (9)  we  have 


Vo 00  <  J  m&x{0,9Y((3)}d(3  for  9y{P)  =  n(Y)  +  ^l_=a(Y).  (11) 

o 


We  consider  three  cases.  In  Case  1,  we  suppose  that  9y(P)  >  0  for  all  P  €  [0, 1].  Then  the  right 
hand  side  of  (11)  is  given  by 


I  9Y(p)dp 
0 


1  1 

H(Y)  +  a{Y)  J( l-  /3)-^2dp  with  J (1  -  =  2. 

o  o 


(12) 


Therefore,  Vo(^0  <  / i(Y )  +  2a(Y)  in  Case  1.  In  Case  2a,  we  suppose  that  9y(P)  <  0  for  all 
P  €  (0, 1).  Then  obviously  Vq(Y)  <  0.  Finally,  in  Case  2b,  let  9y(P)  <  0  for  some  P  €  (0, 1),  but 
not  all.  Then  necessarily  cr(Y’)  >  0  and  n(Y)  <  —a(Y),  and  0y(P)  strictly  increases  with  respect 
to  p.  Let  a  be  the  unique  P  €  (0, 1)  with  0y(a)  =  0,  namely  when 


\/l  —  a  = 


*00 

-f(yY 


(13) 


Then  we  have  that 
l 

J  max{ 0, 6y(p)}dp 
o 


i  i 

J  0y(p)dp  =  (1  -  a)fi(Y)  +  a(Y)  J (1  -  p)~^dp 

Oi  a. 


(1  -  a)n(Y)  +  2a(Y)^/l^ 


a^ll(Y)  +  2  a(Y) 


yYf 

*(Y? 

-V(Y) 


<  <r(Y). 


Y) 

~»{Y) 
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Thus,  in  Case  2b  we  get  Vo 00  <  cr(Y’).  The  conclusion  then  follows  by  putting  together  the  three 
cases.  □ 


We  observe  that  for  a  G  (0, 1),  VQ  is  a  convex,  positively  homogeneous,  monotonic,  and  averse 
functional  on  C2(fl),  which  follows  from  the  properties  of  the  superquantile  [21],  and  by  the  above 
result  it  is  also  finite,  and  consequently  continuous.  A  corresponding  measure  of  error  is  defined 
for  Y  g  C2(n)  by 

£a(Y)  :=  Va{Y)  -  E[Y]  (14) 

and  referred  to  as  a  superquantile  error.  Obviously,  £a  is  also  convex  and  positively  homogeneous. 
It  also  satisfies  the  following  properties. 

Proposition  2  For  any  a  €  (0, 1)  and  Y  e  £2(fl),  a  superquantile  error  satisfies 

(a)  £a{Y)  =  0  when  Y  =  0, 

(b)  £a{Y)  >  0  when  Y  ^  0,  and 

(c)  £a(Y)  >  min{l, a/ (1  -  a)}|£[y]|. 


Proof:  Since  qp{ 0)  =  0  for  all  fi  G  [0, 1],  (a)  follows  trivially. 

Since  Va  is  averse,  we  have  that  for  Y  G  £2(fl),  £a(Y)  =  Va{Y)  —  E\Y]  >  E[Y]  —  E[Y ]  =  0  when 
Y  is  not  a  constant.  To  complete  part  (b),  we  therefore  only  need  to  consider  nonzero  constants. 
If  Y  is  a  positive  constant  K,  then 


f  max{0,  qp(Y)}dfi  -  E[Y]  >  [  max{0,  qp(Y)}dp  -  E[Y]  =  K  -  E[Y]  =  0. 
1  ~  a  Jo  Jo 


If  Y  is  a  negative  constant  K,  then 


1 


1  —  a 


max{0,  qp(Y)}d/3  —  E[Y\  = 


1  —  a 


max{0,  K}dfi  -  E[Y]  =  0  -  E[Y]  >  0, 


which  completes  part  (b). 

Since  qp(Y)  >  E[Y ]  for  all  ft  G  [0, 1],  we  have  whenever  E[Y]  >  0  the  bound 


7^—  I'  max{0,  qp{Y)}dfi  -  E[Y]  >  — —  /  '  max{0,  E[Y}}dfi  -  E[Y]  =  ~^—E[Y}. 
1  -  a  Jo  1- a  J0  1 -  a 

When  E[Y]  <  0, 

f  max{0,  qp{Y)}dfi  -  E[Y]  >  1  [  rnaxlQ,  E[Y]}dfi  -  E[Y]  =  —E[Y}. 

1  -  a  Jo  l- a  Jo 

Part  (c)  then  follows  by  combining  the  two  results. 


In  view  of  Proposition  2  and  the  above  discussion,  £a  is  a  regular  measure  of  error  in  the  sense 
of  [21], 

We  are  now  ready  to  show  that  a  superquantile  is  a  unique  optimal  solution  of  optimization 
problems  involving  Va  and  £a.  As  mentioned,  the  connection  between  a  superquantile  and  Va  is 
also  reached  in  Theorem  7  of  [24]  through  different  means.  The  direct  proof  in  the  present  paper 
and  the  connection  with  a  superquantile  error  are  new. 

Theorem  1  (Superquantile  as  optimal  solution)  For  Y  G  £2(fl)  and  a  G  (0, 1), 

qa(Y)  =  argmin{Co  +  Va(Y  -  Co)}  =  argmiri £a(Y  -  Co). 

Co  £=  R  Co  G  R 


(15) 


Proof:  Let  tp(C)  =  C  +  Va(Y  —  C)  and  'ipp(C)  =  max{0,  qp{Y)  —  C}.  These  are  both  convex 
functions  of  C,  and  ifa  is  nonincreasing.  We  can  use  the  criterion  that 

C  G  argmin </?(C)  tp'AC)  >  0,  <pr_(C )  <  0, 

c 


where,  because  of  the  monotonicity  of  ^g, 


^c)  =  1  +  y^  J(^y.(c)df5, 

o 


m+v) 


-1  if  gp(Y)  >  C, 
0  if  qp(Y)  <  C, 


<P-(C)  =  1  +  Y^  j(^)'+(C)d/3, 
o 


W-(C) 


-1  if  qp(Y)>C, 
0  if  Qp(Y)  <  C. 


Therefore 


i 


I )+(C)dP  =  J (W-(C)dt 3  =  -(1  -  7)  for  C  =  q^Y), 

o  o 

in  which  case  {%l)p)'{C)  =  (' ipp)'+{C )  =  (V’/3)/_(C')  =  1  —  (1  —  7)/(l  —  a).  Thus,  (V’g )'(C)  =  0  corre¬ 
sponds  to  C  =  q~({Y)  for  7  =  a.  Consequently,  the  first  equality  of  the  theorem  holds.  The  second 
follows  directly  from  (14)  and  the  fact  that  a  constant  in  an  objective  function  is  immaterial  with 
regard  to  the  argmin.  □ 


Being  analogous  to  (4)  and  (6),  the  foundations  for  quantile  regression,  the  expressions  (15) 
provide  the  path  to  superquantile  regression  as  developed  in  the  remainder  of  the  paper.  In  fact, 
Theorem  1  shows  that  qa(Y )  is  the  uniquely  ‘closest’  scalar  to  Y  in  the  sense  of  the  superquantile 
error. 

While  not  the  focus  here,  the  optimal  objective  function  value  in  (15)  defines  a  measure  of  risk 
(see  [24]) 

Ua(Y)  :=  min  {C0  +  Va(T  -  C0)}  =  qa(Y)  +  Va(Y  -  qa(Y)) 

CoGR 

for  Y  G  C2(Q)  analogously  to  qa(Y )  in  (5).  A  corresponding  measure  of  deviation,  which  quantifies 
the  nonconstancy  in  a  random  variable,  is  given  by 

Va(Y)  :=  min  £a(Y  -  C0)  =  Ka{Y)  -  E[Y]. 

CoG-R 

We  note  that  parallel  to  (1)  (see  [24]),  1Za(Y)  =  1/(1  —  a)  qp(Y)d/3  and,  consequently, 

Va(Y)  =  — [l  qp(Y)d/3  -  E[Y}. 
l-«ia 

The  measures  of  regret,  error,  risk,  and  deviation  Va,  £<*>  ^a,  and  a  €  (0, 1),  form  a  family  of 
risk  quadrangles  in  the  sense  of  [21]  that  corresponds  to  the  statistic  qa.  The  measure  of  deviation 
T>a  plays  a  central  role  in  the  remainder  of  the  paper  as  it  facilitates  simplifications,  goodness-of-ht 
tests,  and  computational  methods. 


3  Superquantile  Regression 

Theorem  1  and  the  development  leading  to  quantile  regression  direct  us  to  a  new  regression  method¬ 
ology  that  is  centered  on  a  superquantile  error.  The  next  subsection  poses  the  regression  problem, 
provides  its  properties,  and  discusses  stability  under  perturbations.  The  section  ends  with  a  dis¬ 
cussion  of  superquantile  tracking. 
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3.1  Regression  Problem 

While  Theorem  1  shows  that  the  ‘best’  scalar  approximation  of  a  random  variable  Y  in  the  sense 
of  a  superquantile  error  is  the  corresponding  superquantile,  we  now  go  beyond  the  class  of  constant 
functions  to  utilize  the  connection  with  an  underlying  explanatory  random  vector  X.  We  focus  on 
regression  functions  of  the  form 

f(x)  =  C0  +  (C,  h(x)),  C0  e  1R,C  e  Mm, 

for  a  given  ‘basis’  function  h  :  Mn  — >•  Mm.  This  class  satisfies  most  practical  needs  including  that 
of  linear  regression  where  m  =  n  and  h(x)  =  x.  Extensions  beyond  this  class  are  also  possible  but 
not  dealt  with  here. 

For  any  h  :  Mn  — >•  lRm  and  a  G  (0, 1),  we  define  the  superquantile  regression  problem 

P  :  min  Sa(Z(C0,C)), 

C0eR,CeRm 

where 

Z(C0,C)  :=Y-(C0  +  (C,h(X))) 

is  the  error  random  variable ,  whose  distribution  depends  on  Co,  C,  h,  and  the  joint  distribution  of 
(X,Y).  We  denote  by  C  C  Km+ 1  the  set  of  optimal  solutions  of  P  and  refer  to  (Co,  C)  €  C  as  a 
regression  vector. 

The  objective  function  £a(Z(-,  •))  is  well-defined  and  finite  when  the  distribution  of  (X,  Y)  and 
h  is  such  that  Z(Cq,C)  G  £2(P)  for  all  Co  G  M,  C  G  JRm .  A  sufficient  condition  that  ensures  this 
property  is  that  Y,  h\(X), ...,  hm(X )  G  £2(P)  as  shown  next,  where  we  adopt  the  notation 

H  =  h(X),  Hi  =  hi(X),  i  =  1 . 2. ....  m. 

Lemma  2  IfY,  Hu  ...,  Hm  G  £2(P),  then  Z(C0,  C)  G  £2(P)  for  all  C0  G  JR,  C  G  Mm . 

Proof:  Let  M  <  oo  be  such  that  E[Y2]  <  M  and  E[Hf]  <  M,  i  =  1,2 Since  |(C, H)\  < 
\\C\\TT=i  \Hi\ and  (C,  H)2  <  ||C||2  E  (^)2,  we  find  that  E[\  <C,  H)\]  <  ||C||mM  and  E[(C,  H)2]  < 
||C||2mM.  Consequently, 

E[(Y  -C0-  (C,  H))2}  <  E[(Y-Co)2}  +  2\E[(Y-Co)(C,H)}\  +  E[(C,H)2}  (16) 

<  M  +  2(||C||?n1/2Af  +  (Af  +  \C0\)\\C\\mM)  +  ||C||2mAL 

□ 

In  surrogate  estimation,  Co+(C,  h(X)},  with  (Co,  C)  G  C,  provides  the  best  approximation  of  Y 
in  the  sense  of  a  superquantile  error.  For  example,  after  having  computed  (Co,  C),  the  analysis  could 
proceed  with  examining  the  moments,  quantiles,  and  superquantiles  of  Co  +  (C,  h{X))  as  surrogates 
for  the  corresponding  quantities  of  Y .  If  X  is  Gaussian  and  h  is  affine,  then  Co  +  (C,  h(X)}  is  a 
Gaussian  approximation  of  Y  easily  examined  and  utilized  in  further  studies.  It  may  also  be  of 
interest  to  examine  Co  +  (C,  h(X)}  under  hypothetical  distributions  of  X. 

A  direct  consequence  of  the  Regression  Theorem  in  [21]  (see  also  Theorem  3.1  in  [22])  we 
obtain  that  a  regression  vector  can  equivalently  be  determined  from  a  measure  of  deviation  T)a . 

Proposition  3  Suppose  that  Y,H i, ...,  Hm  G  £2(fl).  Then,  the  set  of  regression  vectors  C  of  P  is 
equivalently  obtained  as 

C={  (Co,  C)  G  Mm+1  |  C  G  argmin  Va  (Z0  (C) ) ,  C0  =  ga(Z0(C))j  , 
l  CeRm  J 

where  Zq(C)  :=  Y  —  (C,h(X)). 
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Proposition  3  implies  computational  advantages  as  the  (m  +  l)-dimensional  optimization  problem 
P  is  replaced  by  a  problem  in  m  dimensions  with  a  simpler  objective  function,  which  we  fully  utilize 
in  Sections  5  and  6.  Moreover,  the  result  also  proves  beneficial  in  analysis  of  regression  vectors. 

The  existence  of  a  regression  vector  is  ensured  by  the  next  result,  which  also  provides  conditions 
for  uniqueness. 

Theorem  2  (Existence  and  uniqueness  of  regression  vector)  If  Y,  Hi, ....  Hm  €  C2(IX),  then  P  is  a 
convex  problem  with  a  set  of  optimal  solutions  C  that  is  nonempty,  closed,  and  convex. 

(a)  C  is  bounded  if  and  only  if  the  random  vector  X  and  the  basis  function  h  satisfy  the 
condition  that  ( C,h(X ))  is  not  constant  unless  C  =  0. 

(b)  If  in  addition,  for  every  (Co,C),  ( C'0,C ')  €  Rm+1 ,  with  C  /  C' ,  there  exists  a  /3o  £  [0, 1) 
such  that 

0  <  qp(Z(C0,  C )  +  Z(C’0,  C'))  <  qp(Z(C0,  C ))  +  qg(Z(C0,  C'))  (17) 

for  all  /3  €  \J3o,  1),  then  C  is  a  singleton. 

Proof.  Since  Y  €  C2(H)  implies  that  £a(Y)  <  oo  by  Lemma  1,  we  deduce  the  two  first  conclusions 
from  Theorem  3.1  in  [22].  Hence,  we  only  need  to  show  that  C  is  a  singleton. 

Suppose  for  the  sake  of  a  contradiction  that  (Co,  C),  (C'0,  C)  €  C  and  (Co,  C)  /  ( C'0 ,  C),  with 
corresponding  optimal  value  £  >  0,  i.e.,  £  =  £a(Z(Co,  C))  =  £a(Z(C'0l  C')).  We  consider  two  cases. 
First,  suppose  that  £  =  0.  By  Proposition  2,  Z(Cq,C)  =  Z(C'0,C')  =  0  and  consequently 

C0  +  (C,  H)  =  Cq  +  (Cr,  H), 

which  implies  that  (C  —  C ,  H)  =  C(  —  Cq.  Under  the  assumption  that  (C,  h(X))  is  only  constant 
when  C  =  0,  we  must  have  that  C  —  C'  =  0.  Then,  also  C0  —  Co  =  0  follows,  which  contradicts  the 
hypothesis  that  (Co,C)  /  (C'0,C). 

Second,  suppose  that  £  >  0.  If  C  =  C’ ,  then  a  direct  consequence  of  Proposition  3  and  the 
fact  that  every  random  variable  has  a  unique  superquantile  at  each  probability  level,  is  that  also 
Co  =  Cq,  which  again  contradicts  our  hypothesis.  Consequently,  we  focus  on  the  case  with  C  /  C' , 
for  which  there  exists  a  /3o  such  that  (17)  holds  for  all  f3  €  [/?o>  !)•  Trivially,  then 

max{0,  qp(Z{C0,  C)  +  Z(C'0 ,  C'))}  <  max{0,  qp(Z(C0,  C))}  +  max{0,  qp(Z(C'Q,  C'))} 

for  f3  €  [/3 o,  1)-  If  /I  €  (0, 1)  is  such  that  qp(Z(Co,  C)  +  Z(C'Q,  C'))  <  0,  then 

max{0,  qp(Z(C0,  C)  +  Z(C'0,  C'))}  <  nrax{0,  qp(Z(C0,  C))}  +  max{0,  qp(Z(C'0,  C'))} 

as  the  left-hand  side  vanishes  and  the  right-hand  side  is  nonnegative.  Hence, 

[  max{0,  qp(Z(Co,  C)+Z(C'0,  C'))}d/3  <  f  max{0,  qp(Z(C0,  C))}d/3+  [  max{0,  qp(Z(C'0,  C'))}d(i 
J 0  Jo  Jo 

and  also 

£a(Z(C0,  C)  +  Z(C’0 ,  C'))  <  £a(Z(Co,  C))  +  £a(Z(C'0,  C')).  (18) 

Let 

(C",C")  =  (1/2)(C0,C)  +  (1/2)(C',C') 

and  therefore 

2 Z(C'',  C")  =  Z(C0,  C)  +  Z(C'0 ,  C'). 

By  the  optimality  of  f,  the  positive  homogeneity  of  £a ,  and  (18),  we  find  that 

2^  <  2 £a(Z(C'',C"))  =  £a(2Z(C'',C"))  <  £a(Z(C0,C))  +  £a(Z(C',C'))  =  2^, 
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which  cannot  hold.  In  view  of  this  contradiction,  the  conclusion  follows. 


u 


While  Theorem  2  gives  a  sufficient  condition  for  uniqueness  of  the  regression  vector,  in  general 
uniqueness  cannot  be  expected.  For  example,  suppose  that  the  random  vector  ( X ,  Y),  with  X  scalar 
valued,  has  the  possible  and  equally  likely  realizations  (1, 1),  (2,2),  and  (3, 1).  Then,  qp(Zo(C))  = 
max{l  —  (7,  2  —  2 (7,  1  —  3(7}  for  (3  >  2/3  and  E[Z$(C)\  =  4/3  —  2(7.  It’s  straightforward  to  show 
that  for  a  >  2/3,  any  (7  €  [—1,1]  minimizes  Va(Zo(-)).  Consequently,  in  view  of  Proposition  3, 
any  (7  €  [—1, 1],  with  a  corresponding  Cq  =  max{l  —  (7,  2  —  2(7, 1  —  3(7},  minimizes  £a(Z(-,  •))  for 
a  >  2/3.  The  minimum  error  is  2/3. 

A  unique  regression  vector  is  indeed  achieved  in  the  normal  case  as  stated  next. 

Proposition  4  Suppose  that  (H,  Y )  is  normally  distributed  with  positive  definite  variance- covariance 
matrix.  Then,  C  is  a  singleton. 

Proof:  Let  E  be  the  variance-covariance  matrix  of  (H,  Y),  with  Cholesky  decomposition  E  =  LLT . 
For  any  f3  €  (0, 1)  and  (7  €  JRm,  Zq(C)  is  also  normal  with  mean  /jl{Zq{C))  =  (C,E[(H,Y)])  and 
variance  a2(Zo(C))  =  (C,  EC),  where  C  =  (—(7, 1).  Thus, 

qp(Z0(C))  =  MZo(C'))  +  kf,<r(Z0(C))  =  p{ZQ(C))  +  kp\\LTC\\, 

where  /c,.g  =  </>(< L^1(/3))/(l  —  /3),  with  <f  and  $  being  the  standard  normal  probability  density  and 
cumulative  distribution  functions,  respectively. 

For  <7,  C  €  Mm ,  with  (7  /  C' ,  there  is  no  constant  k  >  0  such  that  (—(7,1)  =  &;(— C',  1). 
Let  C  =  (—(7,1)  and  C'  =  (— C",l).  Since  E  is  positive  definite,  the  upper-triangular  matrix  LT 
is  unique  and  full  rank.  Consequently,  the  null  space  of  LT  contains  only  the  zero  vector  and 
Lt  (C  —  kC')  0  for  all  scalars  k  >  0.  Since  the  triangle  inequality  for  two  vectors  holds  strictly 
whenever  the  two  vectors  cannot  be  expressed  as  a  positive  multiple  of  each  other,  we  therefore 
find  that 

\\ltc  +  ltc'\\  <  \\ltc\\  +  \\ltc'\\. 

Now  suppose  for  the  sake  of  a  contradiction  that  (7,  C  €  Mm  both  minimize  T>a(Zo(-))  and 
attain  the  minimum  value  £  G  JR ,  but  C  7^  C .  Let 

C"  =  (l/2)(7  +  (l/2)(7', 

C"  =  (—(7",  1),  and  7a  =  //'  kpdf3/(l  —  a)  >  0.  Then, 

Pa(Z0((7"))  =  ^  qp(Z0(C"))d(3  -  ^[Z0((7")] 

=  m(^o(C"'))  +  7a||iTC"||  -  /x(Z0(C")) 

=  y||LTC'  +  LT(7'|| 

<  y(l|£T(?ll  +  Ill'll) 

=  ^  ^/i(Z0((7))  +  7q||Lt(7||  -  MZo(C')))  +  ^(m(Z0(C"))  +7a||JbTC'||  -  m(Z0(C'))) 

=  ^(Pa(Z0(C)))  +  ^(PQ(Z0(C'))) 

=  ^(£  +  £)  =  £. 
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However,  this  contradicts  the  optimality  of  C,  C'  and  we  reach  the  conclusion. 


u 


We  next  turn  to  consistency  and  stability  of  the  regression  vector.  Of  course,  the  joint  dis¬ 
tribution  of  (X,Y)  is  rarely  available  in  practice  and  one  may  need  to  pass  to  an  approximating 
empirical  distribution  generated  by  a  sample.  Moreover,  perturbations  of  the  ‘true’  distribution 
of  ( X ,  Y)  may  occur  due  to  measurement  errors  in  the  data  and  other  factors.  We  consider  these 
possibilities  and  let  {X u ,YU)  be  a  random  vector  whose  joint  distribution  approximates  that  of 
(X,Y)  in  some  sense.  For  example,  (XU,YU)  may  be  governed  by  the  empirical  distribution  gen¬ 
erated  by  an  independent  and  identically  distributed  sample  of  size  v  from  (X,Y).  Presumably,  as 
v  — >•  oo,  the  approximation  of  (X,Y)  by  ( XU,YU )  improves  as  stated  formally  below.  Regardless 
of  the  nature  of  (XU,YU),  we  define  the  approximate  error  random  variable 

Zu(Co,  C)  :=Y''-C0-(C,h(X1')), 

and  the  corresponding  approximate  superquantile  regression  problem 

Pv  :  min  £a  {ZU(C0,  C)) . 

C0eR,Cei?m 

The  next  result  shows  that  as  ( X v ,  Yv)  approximates  ( X ,  Y),  a  regression  vector  obtained  from  Pu 
approximates  one  from  P,  which  provides  the  justification  for  basing  a  regression  analysis  on  Pu. 
Below,  we  let  — >d  denote  convergence  in  distribution  and 

Hu  =  h(Xv )  and  Hf  =  K(XV ),  i  =  1,2,  ...in. 

Theorem  3  (Stability  of  regression  vector)  Suppose  that  (XU,YV),  u  =  1,2,...,  and  (X,Y)  are 
n  +  1-dimensional  random  vectors  such  that  (XU,YU)  — >• d  (X,Y)  and  that  the  basis  function  h  is 
continuous  except  possibly  on  a  subset  S  C  Mn  with  Prob(X  e  S)  =  0.  Moreover,  let  Hi,Y  €  £2( Q), 
sup uE[(Hf)2]  <  oo,  i  =  1,2,  ...,m,  and  s\\y>v  E[(YV)2)  <  oo. 

If  m,  is  a  sequence  of  optimal  solutions  of  Pv ,  with  a  €  (0, 1),  then  every  accumu¬ 

lation  point  of  that  sequence  is  a  regression  vector  of  P. 

Proof:  Let  (Cq,C)  €  JRm+1  be  arbitrary.  By  the  continuous  mapping  theorem  (see  for  example 
Theorem  29.2  [3]), 

ZV(C0,  C)  =  Yv  -  C0  -  ( C ,  h{Xv))  Z(C0,  C)  =  Y-C0-  (C,  h(X)). 

By  the  assumed  moment  conditions,  there  exists  a  constant  M  <  oo  that  bounds  from  above  the 
terms 

maxE[\Hi\],  ma xE[(Ht)2],  SnpE[\Hf\],  sup E[(Hf)2],  E[\Y\],  E[Y2],  suPH[|W|],  sup E[{YV)2]. 

1  1  is,i  is,i  v  v 

In  view  of  Lemma  2  and  its  proof,  we  deduce  that 

E[{YU  -C0-  { C , Hu))2}  <  M  +  2(\\C\\mlI2 M  +  (M  +  \C0\)\\C\\mM)  +  \\C\\2mM  (19) 

for  all  v.  Hence,  Zu(Cq,C )  is  uniformly  integrable  (for  fixed  Cq,C)  and 

E[Zu(Co,  C)\  E[Z(C0,  C)}  <  oo;  (20) 

see  [3],  Theorem  25.12  and  its  corollary. 
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By  [24,  Theorem  4] ,  a  sequence  of  random  variables  converges  in  distribution  to  a  random  vari¬ 
able  if  and  only  if  the  corresponding  a-superquantiles,  viewed  as  functions  of  the  probability  level  a , 
converge  uniformly  on  every  closed  subset  of  (0, 1).  Consequently,  qp(Zu(Co,C))  — >•  qp(Z(Co,C)) 
uniformly  in  /3  on  closed  subsets  of  (0, 1).  Moreover,  since  the  O-superquantile  coincides  with  the 
expectation,  (20)  implies  that  qo(Zl'(Co,  C ))  — >•  qo(Z(Co,  C ))  also  holds.  These  facts  and  the  obser¬ 
vation  that  the  superquantile  of  any  random  variable  is  continuous  and  nondecreasing  as  a  function 
of  the  probability  level,  ensure  that  for  any  e  >  0  and  S  G  (0, 1),  there  exists  an  integer  u(e,  5)  such 
that  for  all  v  >  v(e,  (5), 


sup  1 9/3  (Zv(Cq,  C))  —  qp  (Z(Co,  C))|  <  — — r. 

/3e[o,i-<5]  ^  — 


(21) 


Then, 


r*l — 5 


-1  -s 


nrax{0,  qp(Zu(Co,  C))}df3  —  /  max{0,qp(Z(Co,C))}d{i 


"1—5 


<  /  \Qp(Zu(Co,  C))  —  qp(Z(Co,  C))|  d/3 

Jo 

5  e  e 

£  L 

for  all  v  >  v{e,5).  Following  an  argument  similar  to  that  in  Lemma  1,  we  find  that 


(22) 

(23) 

(24) 


'1-5 


max{0,  qp(Z(C0,  C))}d/3  <  51'2a(Z(CQ,  C))  +  max{0,  5^(Z(C0,  C))  +  5l'2a(Z(C^  C))}.  (25) 


Moreover,  the  reasoning  that  lead  to  (19)  also  gives 

HZ(C0,C))\<M  +  \C0\  +  \\C\\mM. 


(26) 


These  facts  show  that  there  exists  a  positive  constant  M  <  oo  (which  depends  on  Co  and  C)  such 
that  \h(Z(Co,C))\,ct(Z(Co,C))  <  M.  Hence,  from  (25),  we  find  that 


'1-5 


max{0,  qp(Z(Co,  C))}d/3  <  3 Md1^2. 


Let  e  <  12 M  and  S£  =  (e/(12 M))2 .  Then,  3 M£e1/2  =  e/4  and 

[  max{0,  qp(Z(Co,  C))}d(3  < 

Jl-Se  4 

An  identical  result  holds  for  Zv(Cq,C).  Consequently,  for  all  v  >  u(e,5e), 


r»l  /*1 

max{0,  qp(ZL'(Co,  C))}d/3  —  /  max{0,  qp(Z(Co,  C))}d/3 


/  o 


'o 


< 


+ 


r»l— S€ 


*l-6e 


/  max{0,  qp(Zu (C0,  C))}d/3  -  /  max{0,  qp(Z(C0,  C))}d/3 
Jo  Jo 

f  max{0  ,qp(Zu(C0,C))}d(5  +  [  max{0  ,qp(Z(C0,C))}d/3 
'1  -5e  JlSe 


„  e  e  e 

-  2  +  4  +  4_e‘ 


(27) 


(28) 
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This  fact,  (20),  and  the  assumption  that  (Co,  C)  is  arbitrary,  imply  that  £a(Zu(-,  •))  — y  £a(Z(-,  •)) 
pointwise  on  Mm+1 .  Lemma  1  and  the  above  moment  assumptions  imply  that  £a(Zu(-,-))  and 
£a(Z(-,  •))  are  finite-valued  functions.  They  are  also  convex,  which  follows  directly  from  the  con¬ 
vexity  of  £a  on  £2(n)  and  the  affine  form  of  Zv  and  Z  as  functions  of  Co  and  C.  Consequently,  by 
Theorem  7.17  in  [23],  £a(Zu(-,  •))  epiconverges  to  £a(Z(-.  •)).  The  result  then  follows  from  Theorem 
7.31  in  [23],  □ 

When  the  approximating  problem  Pv  is  constructed  using  an  independent  identically  dis¬ 
tributed  sample  of  size  v  from  the  distribution  of  (X,  Y),  we  obtain  the  following  corollary  which 
follows  from  the  properties  of  the  empirical  distribution. 

Corollary  1  Suppose  that  the  basis  function  li  is  continuous  except  possibly  on  a  subset  S  C 
J?"  with  Prob(X  G  S)  =  0  and  that  Hl,  Y  G  C2(Q),  i  =  1,2,  Moreover,  let  (Xu ,YU) 

be  distributed  according  to  the  empirical  distribution  generated  by  an  independent  and  identically 
distributed  sample  of  size  v  from  the  distribution  of  (X,Y).  Then,  the  conclusion  of  Theorem  3 
holds. 

We  next  examine  the  rate  of  convergence  of  regression  vectors  obtained  from  the  approximate 
problem  Pu  to  those  of  P  corresponding  to  the  ‘true’  distribution.  Quantification  of  the  stability  of 
the  set  of  optimal  solutions  of  an  optimization  problem  under  perturbations  depends  on  a  ‘growth 
condition’  of  the  problem,  which  is  difficult  to  quantify  for  P;  see  [23,  Section  7J].  Consequently, 
we  focus  on  the  better  behaved  e-regression  vectors  of  P  defined  for  e  >  0  as 

Ce  :=  { (C0,e,  Ce)  G  JRm+1  £a(Z(C0,e,  C€))  <  min  £a(Z(C0,  C))  +  e  )  , 

^  CoER^CeR171  j 

with  an  analogous  definition  of  the  e-regression  vectors  of  Pv  denoted  by  Cf.  The  rate  with  which 
Cf  tends  to  Ce  depends,  naturally,  on  the  rate  with  which  (Xu,  Yu),  underlying  Pu ,  tends  to  (X,  Y ) 
of  P  in  some  sense.  Before  we  make  a  precise  statement,  we  introduce  a  convenient  notion  of 
distances  between  any  two  nonempty  sets  A,B  C  Mm+1.  For  p  >  0,  let 

dIp(A,  B )  :=  inf {77  >  0|A  n  plB  C  B  +  pIB,  B  n  pIB  C  A  +  gIB}, 

where  IB  is  the  Euclidean  ball  in  Mm+l  with  unit  radius  and  center  at  the  origin.  Roughly,  cffp(A,  B) 
is  the  smallest  amount  the  sets  need  to  be  ‘enlarged’  to  ensure  they  contain  the  other  one,  with  an 
exclusive  focus  on  points  no  further  from  the  origin  than  p.  This  restriction  facilitates  the  treatment 
of  unbounded  sets. 

As  we  see  next,  the  rate  of  convergence  is  directly  related  to  the  rate  with  which  the  random 
vector 

:=  (Hu  —  H,YU  —  Y), 

describing  the  approximation  error,  tends  to  zero. 

Theorem  4  (Rate  of  convergence  of  regression  vector)  Suppose  that  {XV,YU),  v  =  1,2,...,  and 
(X,Y)  are  n  +  1-dimensional  random  vectors  generating  Pv  and  P,  respectively.  Moreover,  let 
Hi,Y  G  £2(n),  sup uE[(Hf)2]  <  00,  i  =  1,2 and  sup^P[(W)2]  <  00.  Let  p$  >  0  be  such 
that  polB  nC  /  0  and  poIB  n Cu  /  0. 

Then,  for  p  >  po,  there  exist  positive  constants  k\ ,  k-2,  and  k 3  (dependent  on  p)  such  that  for 
any  e  >  0  and  v  =  1, 2, ..., 

dp(C:.C,)  <  (l  +  7)  [sill  Al]  (*1  max  (o,  log  |  +  kfj  +  *3||£[A1']|| 

whenever  P[||A^||]  >  0  and  dp(Cf,Ce)  =  0  otherwise. 
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Proof:  By  Theorem  3(a)  of  [24],  for  (3  €  [0, 1), 


\qf,(Z"(C„,C))  -  f„(Z(C„,C))|  <  -Z-E[\Z"(C0,C)  -  Z(C0,C)|] 

=  i4a£[l(c,A>')|] 

<  ^IICMIIAI],  (29) 

where  C  =  (— C,  1).  Then,  for  5  €  (0, 1), 

pi— 8  rl— S 

/  max{01q/3(Zu(C0,C))}df3  -  /  max{0,  qp(Z(C0,  C))}df3 

Jo  Jo 

<  f1  SmZ"(C0,C))-qp(Z(C0lC))\d/3  (30) 

Jo 

<  ||C||E[||A1]  £~‘  j^d/3  =  -||C||£[||A1]  log  ,5. 

Let  p  >  po  and  M  be  an  upper  bound  on  first  and  second  moments  of  \H.j\,  \ H'-' | ,  |T|,  and  \YV\  as 
in  the  proof  of  Theorem  3.  Then,  for  ||  (Co,  (7)11  <  p,  it  follows  by  (26)  that 

|p(Z(C0,C))|  <  M  +  p  +  pmM 


and  by  (16)  that 


cr(Z(Co,  C ))  <  ( M  +  2 (pm1/2M  +  (M  +  p)pmM )  +  p2mM )1/2, 

with  identical  bounds  for  \p(Zu(Co,  C))|  and  cr(Zu(Co ,  C1)).  Let  Mp  be  the  larger  of  the  two  previous 
right-hand  sides. 

By  (25),  analogously  to  (27),  we  have  that  for  ||(Cb,(7)||  <  p , 


/  nrax{0,  qp(Z(Co,  C))}d(3  <  3Mpd^2 
Jl—5 

and  similarly  with  Z{Cq,C)  replaced  by  Zv{Cq,C). 

We  also  find  that  for  ||  (Co,  (7)11  <  p, 

\E[Z^C0,C))  -E[Z(C0,C)]\  =  \(C,E[A^})\  <  ||C||||i5[AI']||  <  (1  +  p)\\E[A^\\. 
Then,  collecting  the  results  of  (30),  (31),  and  (32),  we  obtain  that  for  ||(C'o,C')||  <  p, 
\Sa{Zv[CQ,C))-Ea{Z{CQ,C))\ 


(31) 


(32) 


< 


< 


nrax{0,  qp(Zu(Co,  C))}d(3  —  /  max{0,  q^(Z(C0,  C))}d/3 


+ 


E{Z"(C0,C)\-E[Z(C0,C)\ 


hi— 8  pi— 8 

max{0,  qp(Zu(Co,  C))}d/3  —  /  max{0,  qp(Z(C0,  C))}df3 


+ 


+ 


' 1-5 


max{0,  qp(Zu(C0,  C))}d/3 


>1-5 


nrax{0,  qp(Z(C0 ,  C))}d(3 


E[zr{CQ,C)}-  E[Z(C0,C)} 


<  -(1  +  p)E[ II AHI]  log  5  +  6Mp61/2  +  (1  +  p) IIEIA" 


(33) 
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We  next  determine  the  choice  of  <5  €  (0, 1)  that  minimizes  the  previous  bound  and  consider  two 
cases.  First,  if 

o  <  m^IIAI])2  <  i, 

with 

V=(2(l  +  p)/(6Mp))2, 

then  differentiation  gives  that  the  bound  is  minimized  with  6  =  /cp(£'[|| A"||])2.  Second,  if 

M£[IIA1D2  >  l, 

then 

Mp<  4(1  +  p)£[||  Al]/6 

and  the  bound 

-(1  +  p)E[ || AH|]  log <5  +  6 M^2  +  (1  +  p)\\E[A"]\\ 

<  -(1  +  p)E\ || AH|]  log 5  +  4(1  +  p)E[\\ AH|]<51/2  +  (1  +  p)\\E[A^}\\ 

for  any  <5  €  (0,1).  Consequently,  combining  the  two  cases,  there  exist  constants  k\ ,  k2.  and  k% 
(which  depend  on  p),  such  that  for  ||(C'o,C')||  <  p, 

\£a{Zu(Co,  C))  —  £a(Z(Co,  C))| 

<  hE[\\A-\\\  max  jo,  log  (^pTjjJ )  }  +  k2E[\\ Av\\]  +  fe3||^[A-] || 

<  E[\\AU\\]  f&q  max  |o,log  }  +  faj  +  k3\\E[Au}\\. 

Direct  application  of  Example  7.62  and  Theorem  7.69  of  [23]  then  yields  the  conclusion  for  E[||  A"||]  > 
0,  where  the  additional  coefficient  (1  +  4 p/e)  originates  in  that  theorem.  Finally,  if  E[||Ai'||]  =  0, 
then,  in  view  of  (29)  and  the  fact  that  this  implies  that  ||E[A1']||  =  0,  we  find  that  for  ||(Co,  (7)11  <  p, 

I £a(Zv(C0,  C))  -  £a(Z(C0,  C))|  =  0. 

The  final  conclusion  then  follows  by  again  invoking  Example  7.62  and  Theorem  7.69  of  [23].  □ 

Theorem  4  shows  that  the  distance  between  C"  and  Ce  is  almost  proportional  to  £7[||  A"||], 
but  with  a  minor  correction  by  a  logarithmic  term.  If  the  approximation  (XV,YU)  is  caused  by 
measurement  errors  of  magnitude  1  /is,  i.e.,  the  absolute  value  of  each  component  of  (Xu—X,  Yu—Y) 
is  no  greater  than  \/v  almost  surely,  then  £’[||AI'||]  <  \Jm-\-  \/v  and  the  expressions  can  be 
simplified.  For  £  >  0,  log  a;  <  x ^  for  sufficiently  large  x  €  M.  Consequently,  for  any  £  £  (0, 1)  and 
sufficiently  large  v, 

<ffP(C,c«)<  ( i  +  ^)  4 

where  k  >  0  can  be  determined  from  k\ ,  k2,  k3,  and  m.  That  is,  the  Euclidean  distance  between  an 
e-regression  vector  of  Pv  to  one  of  P  is  for  £  £  (0, 1)  arbitrarily  close  to  zero. 
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3.2  Superquantile  Tracking 

We  next  turn  to  the  situation  where  we  seek  to  estimate  qa{Y (x))  for  x  €  JRn ,  or  a  subset  thereof, 
with  the  goal  of  eventually  minimizing,  at  least  approximately,  qa(Y (x))  by  a  judicious  choice  of  x. 
Of  course,  with  incomplete  knowledge  about  the  distributions  of  Y (x)  this  is  a  difficult  task  that 
can  be  achieved  only  approximately.  For  example,  there  is  no  guarantee  that  a  regression  function 
/  =  Co  +  (C,h(-)),  with  (Co,  C)  €  C  obtained  by  solving  P  using  a  €  (0, 1),  tracks  qa(Y(x)),  i.e., 
f(x)  =  qa(Y (x))  for  all  x  €  Mn .  The  hope  of  such  ‘exact’  tracking  becomes  even  less  realistic  when 
P  must  be  replaced  by  an  approximation  Pu  as  typically  required  in  practice.  However,  ‘local’ 
tracking  is  possible,  at  least  approximately,  with  an  appropriate  weighing  of  the  data  available  as 
we  discuss  next. 

We  consider  the  situation  where  there  is  a  sample  of  Y(x)  for  a  set  of  x,  but  the  sample  is  not 
large  enough  to  allow  pointwise  estimation  of  qa{Y{x ))  for  every  x  of  interest.  There  may  even  be 
no  x  for  which  there  are  multiple  samples  of  Y{x).  Concentrating  on  a  particular  x  €  Mn ,  we  hope 
to  estimate  qa(Y (x))  by  using  samples  from  Y (x)  for  x  near  x,  weighted  appropriately.  The  weights 
should  be  nonnegative,  sum  to  one,  and  can  be  thought  of  as  an  artificially  constructed  probability 
distribution  associated  with  the  sample.  Specifically,  suppose  that  Xi,i  =  are  the  points 

where  the  sample  is  observed  and  yi,  i  =  1, ...,  v,  are  the  corresponding  realizations  of  Y(xf).  When 
estimating  a  superquantile  at  x,  we  put  more  ‘trust’  on  sample  points  taken  near  x  and  consequently 
the  weight  of  (. Xi,yi )  may  be  inversely  proportional  to  || Xi  —  x||,  with  an  appropriate  adjustment  if 
x  coincides  with  an  x,> 

A  justification  for  the  approach  follows  directly  from  Theorem  3  through  the  next  proposition. 

Proposition  5  Suppose  that  the  assumptions  of  Theorem  3  hold  and  that  the  probability  distribu¬ 
tion  of  (X,Y)  is  degenerate  at  x  €  IRn+1  in  the  sense  that  Prob((X,Y )  <  (x,  y))  =  (p(y),  for  all 
y  €  M  and  x  >  x,  where  (p(y)  =  Prob(Y(x )  <  y),  and  Prob((X,Y)  <  (. x,y ))  =  0  otherwise.  If 
{(Cq  ,  C1')}^L1  is  a  sequence  of  optimal  solutions  of  Pv ,  with  a  €  (0, 1),  then  along  every  convergent 
subsequence  we  have  that  Cq  +  (Cu,h(x))  tends  to  qa(Y(x)). 

Proof.  For  the  given  degenerate  distribution  of  ( X,Y ),  Cq  +  ( C,h(X )}  =  Cq  +  (C,h(x))  al¬ 
most  surely.  Consequently,  P  reduces  to  the  error  minimization  problem  of  Theorem  1  and 
Co  +  ( C,h(x ))  =  qa{Y(x))  for  every  (Cq,C)  €  C.  The  conclusion  then  follows  from  Theorem 
3.  □ 

Suppose  that  the  weights  of  (xi,yf),  i  =  1,2,  ...,zq  in  the  above  construction  are  chosen  to 
approximate  the  degenerate  distribution  of  Proposition  5,  for  example  by  setting  them  inversely 
proportional  to  || xi  —  x||.  Then,  in  view  of  Proposition  5,  a  solution  of  Pu,  constructed  using 
those  weights  as  an  artificial  probability  distribution  for  (Xu,  Y u),  leads  to  an  approximation  of  the 
considered  superquantile  at  x.  Of  course,  this  procedure  can  be  repeated  for  different  points  x  to 
generate  a  ‘global’  assessment  of  qa(Y(x))  as  a  function  of  x  and  eventually  facilitate  optimization 
over  x.  Moreover,  the  process  can  be  repeated  with  new  or  augmented  sample  points  in  a  straight¬ 
forward  manner.  In  a  situation  where  a  sample  is  not  fully  randomly  generated  but  x-points  are 
determined  by  an  analyst,  the  approach  may  even  motivate  scattering  those  points  near  a  point  of 
interest  x  instead  of  concentrating  them  all  at  x  exactly.  The  former  approach  certainly  results  in 
a  better  ‘global’  understanding  of  a  superquantile  as  a  function  of  x,  but  may  prove  to  be  a  more 
economical  route  to  estimate  a  superquantile  at  x  too.  We  examine  this  situation  numerically  in 
Section  6. 
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4  Validation  Analysis 


Regression  modeling  must  be  associated  with  means  of  assessing  the  goodness-of-fit  of  a  computed 
regression  vector.  In  least-squares  regression,  the  coefficient  of  determination 

t->2  _  -1  _  l~>  ffites 
SST  ’ 

where  SSRes  denotes  the  residual  sum  of  squares  and  SSt  the  total  sum  of  squares,  provides  a 
means  for  such  an  assessment.  While  R 2  can’t  be  relied  on  exclusively,  it  provides  an  indication 
of  the  goodness  of  fit  that  is  easily  extended  to  the  present  context  of  superquantile  regression.  In 
our  notation, 

2  =  E[Z(Cq,C)2] 

R  ct2(T) 

and  similarly  when  passing  to  an  approximate  random  vector  ( XU,YU ).  From  Example  1’  in  [21], 
we  know  that  the  numerator  in  (34)  is  an  error  measure  applied  to  Z(Co,  C)  and  that  it  corresponds 
to  the  deviation  measure  cr2(-).  Moreover,  the  minimization  of  that  error  of  Z(Cq,C)  results  in 
the  least-squares  regression  vector.  According  to  [21],  these  error  and  deviation  measures  are  in 
correspondence  and  belong  to  a  ‘risk  quadrangle’  that  yields  the  expectation  as  its  statistic.  This 
observation  motivates  the  following  definition  of  a  coefficient  of  determination  for  superquantile 
regression  model. 


Definition  1  In  superquantile  regression,  the  coefficient  of  determination  of  a  regression  vector 
(C0,  C)  €  l?;m+1  is  given  by 

Rl(C0,C)  :=  1  -  (35) 

In  fact,  a  similar  definition  can  be  formulated  for  any  generalized  regression  consisting  of  minimizing 
an  error  of  Zf,  with  then  another  measure  of  error  in  the  numerator  and  a  corresponding  deviation 
measure,  in  the  sense  of  [21],  in  the  denominator.  As  in  the  classical  case,  higher  values  of  R2 
are  better,  at  least  in  some  sense.  However,  R2  <  1,  which  is  apparent  from  the  nonnegativity 
of  the  error  and  deviation  measures.  Indeed,  P  aims  to  minimize  the  error  of  Z(Cg,C)  by  wisely 
selecting  the  regression  vector  (Co,C)  and  thereby  also  maximizes  R2.  The  error  is  ‘normalized’ 
with  the  overall  ‘nonconstancy’  in  Y  as  measured  by  its  deviation  measure  to  more  easily  allow  for 
comparison  of  coefficients  of  determination  across  data  sets. 

It’s  possible  to  obtain  large  coefficients  of  determination  by  adding  explanatory  terms  to  a 
regression  model,  i.e.,  increasing  m,  but  without  necessarily  achieving  a  more  useful  model.  Hence, 
it’s  usual  in  least-squares  regression  to  also  evaluate  an  adjusted  coefficient  of  determination  that 
penalizes  any  term  added  to  the  model  that  doesn’t  reduce  variability  substantially.  This  quantity 
only  increases  if  a  new  term  reduces  SSRes/(is  —  m)  as  seen  by  the  definition 

2  _1  SSRes/{n-m) 

Adj  SST/(u  -  1) 

where  v  is  the  number  of  observations.  Naturally,  then,  we  define  an  adjusted  coefficient  of  deter¬ 
mination  for  superquantile  regression  similarly  in  the  case  where  the  distribution  of  ( X ,  Y )  has  a 
finite  support  of  cardinality  v. 


Definition  2  In  superquantile  regression,  the  adjusted  coefficient  of  determination  of  a  regression 
vector  (Cq,C)  €  J?m+ 1  is  given  by 


R: 


t, Adj  (Co,  C)  1  — 


£a(Z(C0,C))/(u-m) 

Va(Y)/(u-  1) 


(37) 


Again,  similar  expressions  are  available  for  other  generalized  regression  techniques. 
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5  Computational  Methods 

The  computational  task  of  carrying  out  superquantile  regression  consists  of  solving  the  convex  op¬ 
timization  problem  P,  or  in  practice  the  approximate  problem  Pu  due  to  incomplete  distributional 
information  and  other  sources  of  approximations.  In  this  section,  we  describe  convenient  means 
for  solving  Pu  when  (Xu ,YU)  has  a  discrete  joint  distribution  with  u  possible  realizations.  Re¬ 
gardless  of  the  distribution  of  {XV,YV),  a  reformulation  of  Pu  in  terms  of  the  deviation  measure 
T>a  is  beneficial.  In  view  of  Proposition  3,  the  task  of  determining  a  regression  vector  (Cq,Cu) 
reduces  to  that  of  minimizing  Pa(Zg(-)),  setting  Cu  equal  to  an  optimal  solution,  and  then  setting 
Cq  =  qa{ZQ  {Cu)).  Since  it’s  straightforward  to  compute  every  superquantile  of  a  random  variable 
with  a  discrete  probability  distribution,  we  focus  on  the  minimization  problem,  which  takes  the 
following  form  after  writing  out  the  expression  for  the  deviation  measure  in  this  case 

Dv:  min  -1-  ['  qp{Z^C))dp  -  E  [Z%{C)\  . 

OGji  -L  CU  J q. 

The  next  subsections  describe  two  computational  methods  for  solving  Du  when  the  distribution  of 
(XU,YU)  is  discrete. 


5.1  Analytical  Integration 

While  one  might  at  first  get  the  impression  that  numerical  integration  is  required  in  solving  Dv , 
this  may  not  actually  be  needed  as  shown  next.  Suppose  that  (Xv ,YU)  has  a  discrete  distribution 
with  support  (x-7,  y-7),  j  =  1,2 ,  ...,zq  and  Prob((Xi',  Yv)  =  (x-7,  y-7))  =  1/V  for  j  =  1,2,  This  is 
the  case  typically  encountered  in  applications,  where  (x-7,  y-7),  j  =  1,  2, ...,  v,  is  the  data  assumed  to 
be  equally  likely  to  occur.  We  then  obtain  significant  simplifications  in  Du . 

For  any  fixed  C  €  Mm,  the  cumulative  distribution  function  of  Zq(C)  is  a  piecewise  constant 
function  with  at  most  v  steps.  The  range  of  the  distribution  function  is  {0, 1/zq  2/zq ...,  1}  or  a 
subset  thereof.  By  partitioning  the  integral  over  f3  in  Du  according  to  this  range,  accounting  for 
the  fact  that  the  integral  starts  at  a,  the  problem  can  in  this  case  be  written  as 


min 

CeRm 


T^—it  r  qp(Z5(C))dp  -  E[Z5(C)], 

I  —  a  f —  jo  , 


(38) 


where  va  :=  \vot\,  with  [~a~|  being  the  smallest  integer  no  smaller  than  a  €  M,  Pua-i  =  and 
Pi  =  i/v.,  for  i  =  ua.  va  +  1, ...,  u.  In  view  of  (4)  and  (5), 

Qp(Zo(C))  =  mmRUp  +  l^E[max{ZZ(C)-Up,0}\  (39) 

=  qp(Z%(C))  +  -  qp{Z%(C)),  0}] 

for  each  /3  G  [0, 1).  However,  the  special  piecewise-constant  structure  of  the  cumulative  distribution 
function  of  Zq{C)  implies  that  qfj(Zp (C))  is  constant  as  a  function  of  ft  on  (/%_!,/%)  for  every 
i  =  va,va  +  Consequently,  Up,  ft  G  (a,  1)  in  (39)  can  be  replaced  by  a  finite  number  of 

variables  so  that  (38)  takes  the  form 


c1&  rb  Y  f  [’  S  (U‘  +  ^Elmax-KfC) 


Ui,  0}] 


d/3-E[ZP(C)]. 
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The  last  integral  simplifies  further  since  for  f5  G  (Pv-i,  /3V)  =  (1  —  1/u,  1) 


qp(Z%(C))  =  M(C)  :=  .  max  y3  -  (C,x3). 


Consequently,  (38)  takes  the  form 
1  X,1  rP> 

mm 

C&Rm  1  - 


E-  E  /  ’  ””  ( ui  +  - Ui' °>] )  h  +  - ^[^O(C')]- 

1  - «  hi- 1 aieR  V  1  -  P  J  v(l-a) 


The  order  of  minimization  is  immaterial  and  we  can  equivalently  consider 
1 


u~l  rfo 


nun 

CeRm,UeRl,~,'0‘ 


—  y 


Ur  +  -i—Elmax{ZZ(C)  -  Ui,  0}])  dp  +  M^C)  -  E[Z%(C)], 
I  —  p  )  zz(I  —  a) 


where  we  let  U  =  (UUa ,  UVa+ 1, Uv-\).  For  i  =  va,  va  +  1, v  —  1,  we  define 


a,-  :  = 


I  Pi- 1  1  -  P 

Using  this  notation,  (38)  simplifies  further  to 

v-\ 


dp  =  log(l  -  Pi- 1)  -  log(l  -  Pi). 


V —  1 


1  1 

min  - - V  (Pi  -  Pi-i)Ui  +  - -  V  £'[max{Zg  (C)  -  Ui,0}]c 

C&Rm,UeR,'-Va  1  —  a  1  —  a 


M(C) 

+  l  lA  )]- 

By  introducing  another  set  of  auxiliary  variables  and  using  the  standard  transcription  technique 
for  handling  max- functions,  we  reach  the  linear  program 


1Z—1  V 


D 


t P  .  nun 

c,u,v,w  1  - 


—  Y'iPi-Pi-iWi  +  1  TV«,D, 

-  a  ^  v(l  —  a)  ^ 

i=ua  i=va  j=  1 

K  ’  7  =  1 


s.t.  y3  -  (C,h(x3))  -  Ui 

< 

Vij,  i  =  ua,...,u-  1  ,j  =  1,.. 

.,v 

0 

< 

Vij,  i  =  va,. . .  ,v  -  1  ,j  =  1, .. 

■,V 

y 3  - 

(C,h(x3)) 

< 

W,  j  =  l,...,u 

c 

G 

Mm 

u  =  (uVa, 

•  •  • ,  Up- i) 

G 

Rv-v° 

ii 

•  • ,  Vp-l)V) 

G 

jj^(u-ua)u 

w 

G 

R. 

This  equivalent  reformulation  of  Du  involves  m  +  (v  —  zza)(i/  +  1)  +  1  variables  and  2 (v  —  va)l/  +  u 
inequality  constraints.  While  va  =  \vot\  may  be  relatively  close  to  v  in  practice,  the  linear  program 
could  become  large-scaled  when  u  is  large  and  decomposition  algorithms  may  be  needed. 

Alternatively,  we  consider  next  a  numerical  integration-based  scheme  that  avoids  some  auxil¬ 
iary  variables  and  constraints,  and  also  handles  the  situation  when  the  distribution  of  (Xv ,  Yu)  is 
not  uniformly  discrete. 
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5.2  Numerical  Integration 

The  integral  in  Du  is  easily  approximated  by  standard  numerical  integration  schemes.  Suppose 
that  the  interval  [a,  1]  is  divided  into  p  subintervals,  where  a  <  (3q  <  fd\  <  . . .  <  /3^_i  <  (3^  <  1 
and  Wi  >  0,  %  =  0, 1,  ...,p,  are  factors  specific  to  the  integration  scheme.  An  approximation  of  Dv 
then  takes  the  form 


:  VT7  E <"<9A(zo(C))  -  EK(C)] . 

O  (E  it  1  Ct 

2=0 

For  large  p,  an  optimal  solution  of  Du,fl  is  close  to  that  of  Du ,  as  seen  next,  under  conditions  that 
are  satisfied  by  essentially  all  commonly  used  numerical  integration  schemes. 


Proposition  6  Suppose  that  for  any  continuous  function  g  :  [a,  1]  — >•  R,  a  numerical  integration 
scheme  with  discretization  points  a  <  /3q  <  Pi  <  ■  ■  ■  <  ffi-i  <  <  1  and  factors  Wi  >  0,  i  = 


0, 1,  ...,11,  satisfies 


^  ^  r l 

^2  Wig{Pi)  -  / 

t=0  Ja 


g(/3)d/3 


as  p  — >•  oo.  Let  {Cv'^f^=1  he  a  sequence  of  optimal  solutions  of  under  this  numerical  integra¬ 
tion  scheme.  Then,  every  accumulation  point  of  is  an  optimal  solution  of  Du . 


Proof:  For  any  C  €  Rm,  <^(Zq(C'))  is  finite  and  continuous  as  a  function  of  /3.  Consequently, 
the  assumption  on  the  numerical  integration  scheme  applies  and  the  objective  function  of  D1’^ 
converges  pointwise  to  that  of  Du,  as  p  — >  oo.  The  objective  functions  are  also  finite  and  convex  in 
C,  which  follows  directly  from  the  convexity  of  qa  on  and  the  affine  form  of  Z'f  as  a  function 

of  C.  Consequently,  by  Theorem  7.17  in  [23],  the  objective  function  of  Dv,tl  epiconverges  to  that 
of  Du  and  the  conclusion  follows  from  Theorem  7.31  in  [23].  □ 


While  specialized  solvers  such  as  Portfolio  Safeguard  [1]  handle  Du'^  directly  with  little  dif¬ 
ficulty  under  many  circumstances,  the  problem  is  typically  nonsmooth  and  standard  nonlinear 
programming  solvers  may  fail.  However,  following  a  simple  reformulation  of  Du,fJ’,  utilizing  (5), 
yields  the  following  equivalent  linear  program,  where  we  assume  for  convenience  that  <  1: 


1 

min  - 

c,u,v  1  -  a 


i= o 


-{C,h{xj))) 

3= 1 


s.t.  yj  -  (C,h(xj))  -  Ui 

0 

c 

U  =  (U0,U1,...,Uli) 
V  =  (Vo,i,  V^v) 


<  Vij,  7  =  0,1,  ...,n  ,  j  =  1, ...,  v 

<  Vij:  i  =  0, 1,  ...,n  ,  j  =  1,  ...,v 
€  Rm 

€  R^+1 
G  R^+1)u. 


If  (ip  =  1,  then  a  straightforward  modification  is  required  based  on  the  fact  that  q\ (Z'f(C))  = 
maxj=ii2r..iV  yi  —  (C,  x J").  The  linear  program  consists  of  m+p  + 1  -\-v(p+ 1)  variables  and  2i/(p  +  l) 
constraints,  which  may  be  substantially  less  than  what  follows  from  the  analytical  integration 
approach  for  large  v.  In  practice,  we  find  that  a  moderately  large  p  suffices  as  shown  in  the  next 
section. 
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6  Numerical  Examples 


In  this  section,  we  illustrate  superquantile  regression  in  three  numerical  examples.  The  first  ex¬ 
ample  is  artificially  constructed,  with  known  conditional  superquantiles.  The  second  example  is 
an  instance  from  the  uncertainty  quantification  literature.  The  third  example  arises  in  investment 
analysis.  Computations  are  mostly  carried  out  in  Matlab  version  7.14  on  a  2.26  GHz  laptop  with 
8.0  GB  of  RAM  using  Portfolio  Safeguard  [1]  with  VAN  as  the  optimization  solver  for  DU,IX.  When 
solving  Dulp  we  employ  GAMS  version  23.7  with  the  CPLEX  12.3  solver  on  a  4.0  GB,  2.50  GHz 
laptop. 

6.1  Example  1:  Solutions  Methods  and  Tracking 

We  start  by  considering  a  loss  random  variable 

Y  =  X i  +  X2e,  almost  surely, 

where  e  is  a  standard  normal  random  variable  and  X  =  (X\,  X^)  is  uniformly  distributed  on 
[—1,1]  x  [0,1],  with  e, X\,  and  X2  independent.  We  consider  a  regression  function  of  the  form 
f(x)  =  Co  +  C\x\  +  C2X2  and  set  a  =  0.90. 

We  first  examine  the  computational  effort  required  to  obtain  an  approximate  regression  vector. 
Table  1  shows  computing  times  for  solving  DULP  for  increasingly  larger  sample  sizes  v  obtained  by 
independent  draws  from  (e,  X\,  X2).  While  the  results  correspond  to  single  instances  of  Dpp,  the 
times  vary  little  between  two  samples  of  the  same  size  and  the  computing  times  are  therefore 
representative.  As  expected  from  the  discussion  at  the  end  of  Section  5.1,  the  computing  time 
grows  quickly  as  the  sample  size  v  increases.  In  addition  to  the  inconvenience  of  long  computing 
times,  memory  requirements  become  problematic.  DVLP  has  a  special  structure  and  we  anticipate 
significant  reduction  in  computing  times  and  memory  needs  resulting  from  tailored  algorithms. 
However,  the  development  of  such  algorithms  is  beyond  the  scope  of  the  paper. 


V 

100 

200 

300 

400 

500 

600 

700 

800 

900 

1000 

1500 

2000 

Time 

0 

0 

2 

6 

17 

32 

45 

65 

163 

174 

996 

2972 

Table  1:  Computing  times  (sec.)  to  solve  DPP  for  increasing  sample  size  in  Example  1. 

Second,  we  consider  the  alternative  approach  based  on  solving  While  this  approach 

introduces  a  numerical  integration  error,  Proposition  6  indicates  that  the  error  is  negligible  for 
large  /i.  In  fact,  as  we  see  next  empirically,  moderately  large  /i  suffices.  Moreover,  the  substantial 
reduction  in  problem  size,  as  compared  to  that  of  Dpp,  reduces  computing  times  dramatically. 

Since  Qb{Zq  (C))  may  be  nonsmooth  as  a  function  of  (3,  standard  numerical  integration  error 
bounds  may  not  apply.  However,  since  qp{ZQ  (C))  is  continuous  and  nondecreasing  as  a  function  of 
/ 3 ,  the  use  of  left-endpoint  and  right-endpoint  numerical  integration  rules  in  Du ,#i  provide  lower  and 
upper  bounds  on  the  optimal  value  of  Du,  respectively.  Table  2  shows  solution  vectors  (Co,  Ci,  C2 ) 
for  /i  =  100,  /i  =  1000,  left-endpoint,  right-endpoint,  and  Simpson’s  numerical  integration  rules, 
and  sample  sizes  of  v  =  100  and  v  =  10000.  Each  solution  of  Dv,,x  is  obtained  quickly,  in  about 
0.5  and  5  seconds  for  v  =  100  and  v  =  10000,  respectively;  see  the  last  column  of  Table  2.  We 
also  show  the  corresponding  coefficient  of  determination  R \  for  each  instance.  For  v  =  100,  the 
solutions  and  R \  are  insensitive  to  the  numerical  integration  rule  as  well  as  /U.  The  obtained 
solutions  are  essentially  identical  to  the  regression  vector  obtained  from  DVLP)  see  Row  8  of  Table 
2.  For  n  =  10000,  we  note  some  differences  but  magnitudes  are  small.  In  this  case,  we  are  unable 
to  solve  DuLp  due  to  its  size.  We  observe  that  as  indicated  by  the  coefficients  of  determination,  the 
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Rule 

V 

Co 

Ci 

c2 

d2 

-^o^o 

Time 

Left  Endpoint 

100 

100 

0.0630 

1.0951 

1.5841 

0.568 

0.07 

Left  Endpoint 

100 

1000 

0.0630 

1.0951 

1.5841 

0.568 

0.79 

Right  Endpoint 

100 

100 

0.0630 

1.0951 

1.5841 

0.568 

0.08 

Right  Endpoint 

100 

1000 

0.0630 

1.0951 

1.5841 

0.568 

0.83 

Simpson’s 

100 

100 

0.0630 

1.0951 

1.5841 

0.568 

0.09 

Simpson’s 

100 

1000 

0.0630 

1.0951 

1.5841 

0.568 

0.77 

Analytic 

100 

NA 

0.0630 

1.0951 

1.5841 

0.568 

0.05 

Left  Endpoint 

10000 

100 

0.0835 

1.0049 

1.6374 

0.392 

0.58 

Left  Endpoint 

10000 

1000 

0.0820 

1.0048 

1.6423 

0.392 

5.91 

Right  Endpoint 

10000 

100 

0.0799 

1.0050 

1.6492 

0.392 

0.56 

Right  Endpoint 

10000 

1000 

0.0816 

1.0048 

1.6435 

0.392 

5.00 

Simpson’s 

10000 

100 

0.0818 

1.0048 

1.6429 

0.392 

0.56 

Simpson’s 

10000 

1000 

0.0818 

1.0048 

1.6430 

0.392 

5.27 

Table  2:  Solution  vectors,  coefficients  of  determination,  and  computing  times  (sec.)  for  Example  1 
with  varying  integration  rule  as  well  as  number  of  intervals  /j  and  observations  u. 


linear  model  /(x)  =  Co  +  C\x\  +  C2X2  doesn’t  fully  capture  the  variability  of  the  data  and  a  study 
of  other  models  may  be  warranted.  However,  we  omit  such  an  investigation  and  instead  turn  to 
superquantile  tracking. 

Third,  we  examine  conditional  values  of  Y  given  realizations  of  X  =  (X\ ,  X2),  i.e.,  superquan¬ 
tile  tracking.  For  x  =  (xi,  X2),  Y (x)  =  Y\X  =  x  is  normally  distributed  with  mean  x\  and  variance 
x'2  •  Consequently,  it  is  straightforward  to  compute  that 

qo.g(Y(x))  =  x\  +  1.7550x2- 

Table  2  shows  vectors  that  only  track  qg_g(Y(-))  approximately,  as  Co,  C\,  and  C2  deviate  from  0, 
1,  and  1.755,  respectively.  In  fact,  there  is  in  general  no  guarantee  that  every  regression  function 
/  will  satisfy  f(x)  =  qa{Y{x))  for  all  x,  even  for  large  sample  sizes.  As  indicated  by  Proposition 
5,  however,  a  superquantile  of  Y (x)  can  be  estimated  by  approximating  a  degenerate  distribution 
of  (X,  Y)  at  x.  Table  3  shows  such  ‘local’  estimates  of  qg$(Y(x))  near  x  =  (0.5, 0.5).  Specifically, 
using  v  =  500  we  compute  Co,  Ci,  and  C2  by  solving  DvLp  as  above,  with  X  sampled  uniformly 
from  [—1, 1]  x  [0, 1].  We  repeat  these  calculations  10  times  with  independent  samples  and  obtain  the 
aggregated  statistics  of  Column  2  of  Table  3.  The  second  row  gives  an  approximate  95%  confidence 
interval  for  the  mean  value  of  Co  +  0.5Ci  +  O.5C2  across  the  10  nreta-replications.  The  interval 
contains  <7o.9(T((0.5,  0.5)))  =  1.3775,  but  is  somewhat  wide.  Proposition  5  indicates  that  sampling 
from  a  smaller  set  [0.45,0.55]  x  [0.45,0.55]  will  tend  to  improve  the  estimate  of  <7o.g(T((0.5,  0.5))). 
Column  3  of  Table  3  illustrates  this  effect,  by  showing  results  comparable  to  those  of  Column  2  and 
Row  2,  but  for  the  smaller  interval.  As  expected,  the  confidence  interval  for  Co  +  0.5Ci  +  O.5C2 
narrows  around  the  correct  value.  The  last  column  shows  similar  results,  but  now  for  sampling 
of  X  uniformly  on  [0.495,0.505]  x  [0.495,0.505].  The  estimate  of  qo.g(Y((0.5, 0.5)))  improves  only 
marginally,  with  the  residual  uncertainty  being  due  to  the  inherent  variability  in  the  (relatively 
small)  samples.  The  narrow  sampling  interval  causes  the  last  estimate  to  be  similar  to  that  obtained 
by  the  standard  empirical  estimate  from  500  realization  of  Y ((0.5,  0.5)),  which  yields  the  confidence 
interval  (1.312,1.462). 

While  sampling  on  smaller  sets  gives  better  local  estimates  of  qo.g(Y(x)),  the  global  picture 
deteriorates.  The  last  three  rows  of  Table  3  show  corresponding  approximate  95%  confidence 
intervals  for  Co,  Ci,  and  C2,  respectively.  While  C0+C1X1+C2X2  generated  by  the  set  [—1, 1]  x  [0, 1] 
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provides  a  reasonably  good  global  picture  of  qo.g(Y(x)),  the  smaller  sets  lose  that  quality  as  seen 
from  the  wide  confidence  intervals.  In  view  of  the  above  results,  we  see  that  an  analyst  that  can 
choose  “design  points,”  i.e.,  points  x  at  which  to  sample  Y(x),  should  balance  the  need  for  accurate 
local  estimates  with  that  of  global  estimates.  In  fact,  even  if  the  primary  focus  is  on  estimating 
qa(Y (x))  for  a  given  x,  as  we  see  in  this  example,  it  may  be  equally  effective  to  spread  the  samples 
of  A  near  x  instead  of  exactly  at  x,  and  then  obtain  some  global  information  about  qa(Y(-))  too. 
Our  methodology  provides  a  flexible  framework  for  estimating  qa(Y (x))  even  if  there  is  only  a  small 
number  of  realization  of  Y(x),  or  even  none,  available.  The  estimates  are  based  on  realization  of 
Y ( x ')  for  x'  near  x.  None  of  the  numerical  examples  in  this  paper  include  data  with  more  than  one 
realization  of  Y (x)  for  any  x. 


A  range: 

[-1,1]  x  [0,1] 

[0.45,  0.55]2 

[0.495, 0.505]2 

C0  +  0.5Ci  +  0.5C2 

(1.349,  1.575) 

(1.329,  1.475) 

(1.330,  1.473) 

Co 

(0.029,  0.123) 

(-2.414,  1.784) 

(-23.715,  18.329) 

Ci 

(0.971,  1.075) 

(-0.229,  3.597) 

(-11.063,  25.656) 

c2 

(1.523,  1.975) 

(-1.686,  5.186) 

(-33.916,  35.701) 

Table  3:  Approximate  95%  confidence  intervals  when  tracking  <70. g(U(-))  in  Example  1  near  x  = 
(0.5, 0.5)  using  shrinking  sampling  ranges  for  X.  The  correct  value  c/o.9(E((0.5, 0.5)))  =  1.378. 


6.2  Example  2:  Uncertainty  Quantification 

The  next  example  arises  in  uncertainty  quantification  of  a  rectangular  cross  section  of  a  structural 
column  under  uncertain  material  properties  and  uncertain  loads;  see  [8]  for  details.  The  performance 
of  the  column  is  described  by  the  random  variable 

Y  =  _1  +  wo ek  +  ^kq’ almost  surely'  (40) 

where  the  moment  load  X\  and  the  axial  load  A2  are  normally  distributed  with  mean  2000  and 
standard  deviation  400,  and  mean  500  and  standard  deviation  100,  respectively,  and  the  material 
strength  A3  is  lognormally  distributed  with  parameters  5  and  0.5,  with  X\,  X2,  and  A3  independent. 
(We  note  that  the  orientation  of  the  performance  random  variable  is  switched  compared  to  that  of 
[8]  for  consistency  with  our  focus  on  ‘losses’  instead  of  ‘gains.’)  We  set  the  width  w  =  3,  and  the 
depth  d  =  12. 

We  seek  to  quantify  the  ‘uncertainty’  in  Y  by  surrogate  estimation.  Of  course,  in  this  case, 
this  is  hardly  necessary;  direct  use  of  (40)  suffices.  However,  in  practice,  an  analytic  expression 
for  a  system  performance,  as  in  (40),  is  rarely  available.  One  then  proceeds  with  determining  a 
regression  function  /  :  1R 3  — >•  1R,  based  on  a  sample  of  input-output  realizations,  such  that  /(A), 
with  A  =  ( Ai ,  A2 ,  A3) ,  approximates  Y  in  some  sense.  To  mimic  this  situation,  we  consider  a 
sample  of  size  50000  drawn  independently  from  A,  the  corresponding  realizations  of  Y  according 
to  (40),  and  two  forms  of  the  regression  function.  The  first  model  is  linear  and  takes  the  form 

fi{x)  =  Co  +  C1X1  +  C2X2  +  C3X3 

and  the  second  one  utilizes  basis  functions  h\(x)  =  x  1  / X3  and  h 2(x)  =  (X2/X3)2  and  is  of  the  form 

f2  (x)  =  C0  +  Cixi/x3  +  C2x|/x§. 

In  view  of  (40),  we  expect  f\  to  be  unable  to  capture  interaction  effects  between  variables  and  its 
explanatory  power  may  be  limited.  In  contrast,  /2  uses  the  correct  basis  functions,  but  even  then 
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Model 

a 

Co 

102Ci 

104C2 

io4c3 

Rl 

h 

0.999 

-0.6797 

0.0156 

7.9000 

-9.1100 

0.154 

h 

0.99 

-0.8084 

0.0150 

3.8000 

-8.2700 

0.190 

h 

0.9 

-0.8579 

0.0107 

1.5900 

-7.7000 

0.260 

h 

0.75 

-0.8705 

0.0090 

1.0800 

-7.5900 

0.301 

h 

LS 

-0.8827 

0.0070 

0.5921 

-7.7180 

0.571* 

h 

0.999 

-1.0457 

1.8640 

0.0300 

NA 

0.902 

h 

0.99 

-1.0450 

1.6182 

0.0400 

NA 

0.891 

h 

0.9 

-1.0308 

1.3393 

0.0200 

NA 

0.894 

h 

0.75 

-1.0261 

1.2595 

0.0200 

NA 

0.893 

h 

LS 

-1.0179 

1.1315 

0.0056 

NA 

0.979* 

Table  4:  Approximate  regression  vectors  and  coefficients  of  determination  in  Example  2  for  varying 
a  and  least-squares  (LS)  regression.  An  asterisk  indicates  that  the  coefficient  of  determination  is 
determined  by  (34). 


/2(A)  may  deviate  from  Y  due  to  the  finite  sample  size  used  to  determine  the  regression  vector. 
Table  4  confirms  this  intuition  by  showing  approximate  regression  vectors  for  both  models  over 
a  range  of  probability  levels  a  as  well  as  for  the  least-squares  (LS)  regression.  The  vectors  are 
obtained  in  less  than  15  seconds  by  solving  Du ,M,  with  v  =  50000,  fi  =  1000,  and  Simpson’s  rule. 
The  last  column  of  Table  4  shows  R2  (classical  coefficient  of  determination  according  to  (34)  in  the 
case  of  least-squares  regression),  which  is  low  for  f\  and  high  for  as  expected. 

In  uncertainty  quantification  and  elsewhere,  surrogate  estimates  such  as  fi(X)  and  /2(A)  are 
important  input  to  further  analysis  and  simulation.  Table  5  illustrates  the  quality  of  these  surrogate 
estimates  in  this  regard  by  showing  various  statistics  of  /i(A)  and  /2(A)  as  compared  to  those  of 
Y.  Row  2,  Columns  3-10  show  estimated  mean,  standard  deviation,  superquantiles  at  0.75,  0.9, 
0.99,  0.999,  probability  of  failure,  and  buffered  probability  of  failure  (see  (2))  of  A,  respectively, 
using  a  sample  size  of  107  and  standard  estimators.  Coefficients  of  variation  for  these  estimators  are 
ranging,  approximately,  from  10-5  for  the  mean  to  0.02  for  the  probability  of  failure.  Rows  3-6  of 
Table  5  show  similar  results,  using  the  same  sample,  for  /i(A),  with  a  =  0.999,  0.99,  0.9,  and  0.75, 
respectively.  We  notice  that  as  a  increases,  /i(A)  becomes  increasingly  conservative.  In  fact,  for 
a  =  0.999,  / i(A)  is  conservative  in  all  statistics.  Superquantile  regression  with  smaller  a  fails  to  be 
conservative  for  some  ‘upper-tail’  statistics.  Interestingly,  /i(A)  based  on  a  is  conservative  for  all 
superquantiles  up  to  and  including  qa  in  these  tests.  These  observations  indicate  that  in  surrogate 
estimation  the  probability  level  a  should  be  selected  in  accordance  with  the  superquantile  statistic  of 
interest.  We  can  then  expect  to  obtain  conserve  estimates  even  for  relatively  poor  surrogates.  Row 
7  of  Table  5  gives  corresponding  results  for  /i(A)  under  the  least-squares  regression  fit.  While  this 
fit  provides  an  accurate  estimate  of  the  mean  (see  Column  3),  the  upper-tail  behavior  is  represented 
in  a  nonconservative  manner. 

Rows  8-12  of  Table  5  show  comparable  results  to  those  above,  but  for  the  /2(A)  models.  As 
also  indicated  in  Table  4,  /2(A)  is  a  much  better  surrogate  of  Y  than  / i(A)  and  essentially  all 
quantities  improve  in  accuracy.  For  example,  /2(A)  based  on  superquantile  regression  overesti¬ 
mates  the  buffered  failure  probability  only  moderately  with  a  =  0.999,  0.99,  and  0.9,  and  slightly 
underestimate  with  a  =  0.75;  see  the  last  column  of  Table  5.  In  contrast,  least-squares  regression 
underestimates  the  buffered  failure  probability  substantially  even  for  this  supposedly  ‘accurate’ 
model.  Of  course,  least-squares  regression  centers  on  conditional  expectations  and  as  basis  for 
estimating  tail  behavior  may  hide  potentially  dangerous  risks. 
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Model 

a 

0 

<70.75 

<70.9 

<70.99 

<70.999 

103p 

103p 

Y 

NA 

-0.8436 

0.0996 

-0.7113 

-0.6211 

-0.3501 

0.0091 

0.3575 

1.052 

fi(X) 

0.999 

-0.1259 

0.1297 

0.0305 

0.0856 

0.1868 

0.2635 

158.1838 

376.995 

h{X) 

0.99 

-0.4575 

0.1027 

-0.3370 

-0.2963 

-0.2225 

-0.1669 

0 

0 

fi(X) 

0.9 

-0.6940 

0.0828 

-0.6016 

-0.5728 

-0.5219 

-0.4843 

0 

0 

MX) 

0.75 

-0.7641 

0.0777 

-0.6795 

-0.6544 

-0.6106 

-0.5786 

0 

0 

MX) 

LS 

-0.8439 

0.0748 

-0.7653 

-0.7439 

-0.7077 

-0.6819 

0 

0 

f2(X) 

0.999 

-0.7611 

0.1647 

-0.5381 

-0.3961 

-0.0053 

0.44953 

3.4410 

9.713 

f2(X) 

0.99 

-0.7979 

0.1431 

-0.6042 

-0.4808 

-0.1413 

0.25383 

1.4909 

4.206 

f2(X) 

0.9 

-0.8263 

0.1184 

-0.6660 

-0.5640 

-0.2831 

0.04375 

0.4702 

1.332 

f2(X) 

0.75 

-0.8337 

0.1113 

-0.6830 

-0.5870 

-0.3229 

-0.0155 

0.3194 

0.899 

f2(X) 

LS 

-0.8451 

0.1000 

-0.7097 

-0.6235 

-0.3864 

-0.1104 

0.1539 

0.440 

Table  5:  Statistics  of  J\  (X)  and  f2(X)  in  Example  2  as  compared  to  those  of  Y.  Columns  3-10 
show  mean,  standard  deviation,  superquantiles  at  0.75,  0.9,  0.99,  0.999,  probability  of  failure,  and 
buffered  probability  of  failure,  respectively. 


6.3  Example  3:  Investment  Analysis 

The  last  example  is  a  case  study  taken  from  the  “Style  Classification  with  Quantile  Regression” 
documentation  in  Portfolio  Safeguard  [1]  and  deals  with  the  negative  return  of  the  Fidelity  Magellan 
Fund  as  predicted  by  the  explanatory  variables  Russell  1000  Growth  Index  (Xi,  RLG),  Russell  1000 
Value  Index  ( X2 ,  RLV),  Russell  Value  Index  (X3,  RUJ),  and  Russell  2000  Growth  Index  (X4.  RUO). 
(We  change  the  orientation  from  ‘return’  to  ‘negative  return’  to  be  consistent  with  the  orientation 
of  a  loss  random  variable  in  the  present  paper.)  The  indices  classify  the  style  of  the  fund;  see  [1] 
for  details.  There  are  v  =  1264  total  observations  available. 


Regression 

a 

Co 

Ci  (RLG) 

C2  (RLV) 

C3  (RUJ) 

C4  (RUO) 

Rl 

Least-squares 

NA 

0.0010 

-0.5089 

-0.5180 

0.0484 

0.0061 

0.9824* 

Quantile 

0.75 

0.0045 

-0.5438 

-0.4518 

0.0159 

0.0173 

— 

Superquantile 

0.75 

0.0095 

-0.5036 

-0.4723 

0.0192 

0.0009 

0.8735 

Quantile 

0.90 

0.0089 

-0.5177 

-0.4602 

0.0156 

-0.0001 

— 

Superquantile 

0.90 

0.0138 

-0.4837 

-0.4912 

0.0223 

-0.0019 

0.8722 

Table  6:  Approximate  regression  vectors  and  coefficients  of  determination  in  Example  3  for  model 
fi.  An  asterisk  indicates  that  coefficient  of  determination  is  determined  by  (34). 

We  start  by  considering  a  linear  model  fi(x)  =  C0+C1X1+C2X2+C3X3+C4X4  and  compare  the 
obtained  approximate  regression  vectors  for  least-squares,  quantile,  and  superquantile  regression 
under  a  =  0.75  and  0.90,  as  shown  in  Table  6.  Du  is  solved  through  DU,IX  with  Simpson’s  rule  and 
[i  =  1000,  while  quantile  regression  is  carried  out  directly  in  Portfolio  Safeguard’s  Shell  Environment 
[1].  Table  6  also  shows  the  coefficients  of  determination,  where  for  least-squares  regression  we  use 
(34).  The  fits  are  good  and  a  majority  of  the  variability  in  the  data  is  captured.  However,  the 
small  values  of  C4  and  also  the  corresponding  p-value  from  the  least-squares  regression  point  to 
the  possible  merit  of  dropping  X4  (RUO)  as  explanatory  variable.  We  from  now  on  focus  on 
superquantile  regression.  A  new  model  f2(x)  =  Cq  +  C\X\  +  C2X2  +  C3X3  yields  the  approximate 
regression  vectors  of  Table  7,  which  also  shows  the  obtained  adjusted  coefficients  of  determination 
B?a  Ad- .  The  switch  from  R2a  to  R2a  Ad ■  enable  us  to  better  compare  fits  across  models  with  different 
number  of  explanatory  variables.  In  comparison,  adjusted  coefficients  of  determination  for  f\,  with 
a  =  0.75  and  0.90,  are  0.8732  and  0.8719,  respectively.  Consequently,  the  fit  improves  slightly  by 
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dropping  X4  (RUO). 


Regression 

a 

Co 

Ci  (RLG) 

C2  (RLV) 

C3  (RUJ) 

d  2 

^a.Adj 

Superquantile 

0.75 

0.0095 

-0.5028 

-0.4728 

0.0200 

0.8733 

Superquantile 

0.90 

0.0138 

-0.4855 

-0.4906 

0.0210 

0.8720 

Table  7:  Approximate  regression  vectors  and  adjusted  coefficients  of  determination  in  Example  3 
for  model  f2  ■ 

We  further  reduce  the  model  to  a  single  explanatory  variable  and  examine  the  four  possibilities 
in  Table  8.  We  find  that  A(])  deteriorates,  but  only  moderately  for  the  model  Co  +  C\X\.  This 
simple  model  captures  much  of  the  variability  in  the  data  set.  A  somewhat  poorer  fit  is  achieved  by 
X2  (RLV),  which  is  illustrated  in  Figure  1  for  a  =  0.90.  That  figure  also  depicts  the  corresponding 
quantile  and  least-squares  regression  lines.  It’s  apparent  that  superquantile  regression  provides  a 
distinct  perspective  from  the  other  regression  techniques  of  potential  significant  value  to  a  decision 
maker. 


Model 

a 

Co 

Ci  (RLG) 

C2  (RLV) 

C3  (RUJ) 

C4  (RUO) 

f>2 

Jri'a,Adj 

Co  +  C\X\ 

0.75 

0.0137 

-0.8228 

— 

— 

— 

0.7380 

Co  +  C\X\ 

0.90 

0.0218 

-0.8189 

— 

— 

— 

0.7248 

Co  +  C2X2 

0.75 

0.0321 

— 

-1.0668 

— 

— 

0.5940 

Co  +  C2X2 

0.90 

0.0475 

— 

-1.0727 

— 

— 

0.5702 

Co  +  C3X3 

0.75 

0.0515 

— 

— 

-0.7745 

— 

0.4103 

Co  +  C3X3 

0.90 

0.0714 

— 

— 

-0.6949 

— 

0.4162 

Co  +  C4X4 

0.75 

0.0344 

— 

— 

— 

-0.5498 

0.3962 

Co  +  C4X4 

0.90 

0.0512 

— 

— 

— 

-0.5145 

0.2593 

Table  8:  Approximate  regression  vectors  and  adjusted  coefficients  of  determination  in  Example  3 
for  superquantile  regression  with  single-variable  models. 


7  Conclusions 

The  paper  presents  a  superquantile  regression  methodology  centered  on  the  minimization  of  a  mea¬ 
sure  of  error  analogous  to  classical  least-squares  and  quantile  regression.  We  establish  the  existence 
of  a  regression  function,  discuss  its  possible  uniqueness,  and  its  stability  under  perturbation,  for 
example  caused  by  sample  approximations  of  a  true  distribution.  A  new  coefficient  of  determina¬ 
tion  allows  us  to  quantify  the  goodness  of  fit.  We  show  that  superquantile  regression  requires  the 
solution  of  a  linear  program,  as  in  the  case  of  quantile  regression,  or  alternatively  of  an  optimiza¬ 
tion  problem  with  superquantile  (conditional  value-at-risk)  constraints.  Our  computational  tests 
demonstrate  that  superquantile  regression  is  computationally  tractable,  provides  new  insight  about 
tail-behavior  for  quantities  of  interest,  and  offers  a  complementary  tool  for  the  risk-averse  decision 
maker. 
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Figure  1:  Regression  lines  in  Example  3  for  model  Co  +  C2X2. 
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